I’m confused as to the purpose of num_threads
is inside copy_dram_to_sram_async
. The documentation states
num_threads (Int): Total number of threads participating in the copy operation. Defaults to the size of src_thread_layout.
however from the implementation
alias num_busy_threads = src_thread_layout.size()
# We know at compile time that only partial threads copy based on the size
# of input tensors. Return if current thread doesn't have work.
@parameter
if num_threads > num_busy_threads:
if thread_idx.x >= num_busy_threads:
return
implies it is there to disable threads are not part of the copy operation. e.g. Given a 1d array of 1024
elements, a 1d block of 1024
threads and a thread_layout
for the copy operation of Layout.row_major(1, 32)
alias layout = Layout.row_major(1, 1024)
input = LayoutTensor[mut=False, dtype, layout](inp.unsafe_ptr())
...
shared = tb[dtype]().row_major[1, 1024]().shared().alloc()
alias load_layout = Layout.row_major(1, 32)
copy_dram_to_sram_async[thread_layout=load_layout, num_threads=1024](shared, input)
num_threads=1024
is disable threads 32,...,1023
from issuing extra copy operations.
In this case num_threads
would be the total number of threads in the block not the number participating in the copy operation. Is this correct?