Hi everyone. I’m a beginer trying to learn about GPU programming with Mojo.
In Puzzle 16 it is suggested to use the following code to load data from global memory to shared memory
alias load_a_layout = Layout.row_major(1, TPB) # Coalesced loading
alias load_b_layout = Layout.row_major(1, TPB) # Coalesced loading
@parameter
for idx in range(size // TPB): # Perfect division: 9 // 3 = 3 tiles
# Get tiles from A and B matrices
a_tile = a.tile[TPB, TPB](block_idx.y, idx)
b_tile = b.tile[TPB, TPB](idx, block_idx.x)
# Asynchronously copy tiles to shared memory with consistent orientation
copy_dram_to_sram_async[
thread_layout=load_a_layout,
num_threads=NUM_THREADS,
block_dim_count=BLOCK_DIM_COUNT,
](a_shared, a_tile)
copy_dram_to_sram_async[
thread_layout=load_b_layout,
num_threads=NUM_THREADS,
block_dim_count=BLOCK_DIM_COUNT,
](b_shared, b_tile)
From the puzzle solution explanation and some self digging, with TPB=3 and NUM_THREADS=TPB*TPB=9, I think this code will make only 3 threads out 9 threads do the loading. The reason for this is each thread can do a SIMD load which can load multiple consecutive data from memory. However, I think the simd load data size should be aligned with power of 2 (2, 4, 8…) bytes. With TPB=3, I wonder how can the simd load be carried out efficiently?