Purpose of num_threads in copy_dram_to_sram_async

I’m confused as to the purpose of num_threads is inside copy_dram_to_sram_async. The documentation states

num_threads (Int): Total number of threads participating in the copy operation. Defaults to the size of src_thread_layout.

however from the implementation

alias num_busy_threads = src_thread_layout.size()

# We know at compile time that only partial threads copy based on the size
# of input tensors. Return if current thread doesn't have work.
@parameter
if num_threads > num_busy_threads:
    if thread_idx.x >= num_busy_threads:
        return

implies it is there to disable threads are not part of the copy operation. e.g. Given a 1d array of 1024 elements, a 1d block of 1024 threads and a thread_layout for the copy operation of Layout.row_major(1, 32)

alias layout = Layout.row_major(1, 1024)
input = LayoutTensor[mut=False, dtype, layout](inp.unsafe_ptr())
...
shared = tb[dtype]().row_major[1, 1024]().shared().alloc()
alias load_layout = Layout.row_major(1, 32)
copy_dram_to_sram_async[thread_layout=load_layout, num_threads=1024](shared, input)

num_threads=1024 is disable threads 32,...,1023 from issuing extra copy operations.

In this case num_threads would be the total number of threads in the block not the number participating in the copy operation. Is this correct?

That’s correct! the documentation needs to be fixed. It should be

num_threads (Int): Total number of threads in the thread block. Threads beyond src_thread_layout.size() will be disabled and not participate in the copy operation.

cc @arthur