Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul

Hi everyone. I’m a beginer trying to learn about GPU programming with Mojo.

In Puzzle 16 it is suggested to use the following code to load data from global memory to shared memory

alias load_a_layout = Layout.row_major(1, TPB)  # Coalesced loading
alias load_b_layout = Layout.row_major(1, TPB)  # Coalesced loading

@parameter
for idx in range(size // TPB):  # Perfect division: 9 // 3 = 3 tiles
    # Get tiles from A and B matrices
    a_tile = a.tile[TPB, TPB](block_idx.y, idx)
    b_tile = b.tile[TPB, TPB](idx, block_idx.x)

    # Asynchronously copy tiles to shared memory with consistent orientation
    copy_dram_to_sram_async[
        thread_layout=load_a_layout,
        num_threads=NUM_THREADS,
        block_dim_count=BLOCK_DIM_COUNT,
    ](a_shared, a_tile)
    copy_dram_to_sram_async[
       thread_layout=load_b_layout,
       num_threads=NUM_THREADS,
       block_dim_count=BLOCK_DIM_COUNT,
    ](b_shared, b_tile)

From the puzzle solution explanation and some self digging, with TPB=3 and NUM_THREADS=TPB*TPB=9, I think this code will make only 3 threads out 9 threads do the loading. The reason for this is each thread can do a SIMD load which can load multiple consecutive data from memory. However, I think the simd load data size should be aligned with power of 2 (2, 4, 8…) bytes. With TPB=3, I wonder how can the simd load be carried out efficiently?

Great question. Quick answer is yes only 3 threads out of 9 do the loading but there is no SIMD loading.

Explanation

Using a load layout

Layout.row_major(1, TPB)

with TPB==3 columns over a 3x3 tile distributes the copy operation over the rows. i.e. The TPB active threads load a row (TPB columns) of a and b and because there are 3 rows in each tile this is performed 3 times. Alternatively (thread perspective) each thread loads a column of the a and b tile one element at a time. You can see this if you output the PTX while running solution as

mojo -D DUMP_GPU_ASM=True solutions/p16/p16.mojo --idiomatic-tiled

Copy to a_tile: Three 4 byte load instructions, each one seperated by a 36=9*4 byte stride (SIZE_TILED element row of the matrix).

add.s64         %rd19, %rd26, %rd49;
add.s32         %r13, %r57, %r59;
cp.async.ca.shared.global [%r13], [%rd19], 4;
add.s64         %rd20, %rd19, 36;
add.s32         %r14, %r13, 12;
cp.async.ca.shared.global [%r14], [%rd20], 4;
add.s64         %rd21, %rd19, 72;
add.s32         %r15, %r13, 24;
cp.async.ca.shared.global [%r15], [%rd21], 4;

Copy to b_tile : Same as a_tile

add.s64         %rd22, %rd25, %rd49;
add.s32         %r16, %r58, %r59;
cp.async.ca.shared.global [%r16], [%rd22], 4;
add.s64         %rd23, %rd22, 36;
add.s32         %r17, %r16, 12;
cp.async.ca.shared.global [%r17], [%rd23], 4;
add.s64         %rd24, %rd22, 72;
add.s32         %r18, %r16, 24;
cp.async.ca.shared.global [%r18], [%rd24], 4;

Alternatively

Layout.row_major(TPB, 1)

will distribute the work over the columns with each thread loading a row of the a and b tiles and

Layout.row_major(TPB, TPB)

will load one element of each tile per thread.

In practice the choice of loading pattern will be algorithm dependant. i.e. Should all threads participate in the load or should only a subset load with the rest consuming the data (producer, consumer pattern).

To summarize this problem does not demonstrate vectored loading with aligned_load as discussed in elementwise - Basic GPU Functional Operations - Mojo 🔥 GPU Puzzles. My interpretation of the choice of 3 threads per row/column is that it makes the problem small enough to visualize at the expense of performance.

3 Likes