Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul

Great question. Quick answer is yes only 3 threads out of 9 do the loading but there is no SIMD loading.

Explanation

Using a load layout

Layout.row_major(1, TPB)

with TPB==3 columns over a 3x3 tile distributes the copy operation over the rows. i.e. The TPB active threads load a row (TPB columns) of a and b and because there are 3 rows in each tile this is performed 3 times. Alternatively (thread perspective) each thread loads a column of the a and b tile one element at a time. You can see this if you output the PTX while running solution as

mojo -D DUMP_GPU_ASM=True solutions/p16/p16.mojo --idiomatic-tiled

Copy to a_tile: Three 4 byte load instructions, each one seperated by a 36=9*4 byte stride (SIZE_TILED element row of the matrix).

add.s64         %rd19, %rd26, %rd49;
add.s32         %r13, %r57, %r59;
cp.async.ca.shared.global [%r13], [%rd19], 4;
add.s64         %rd20, %rd19, 36;
add.s32         %r14, %r13, 12;
cp.async.ca.shared.global [%r14], [%rd20], 4;
add.s64         %rd21, %rd19, 72;
add.s32         %r15, %r13, 24;
cp.async.ca.shared.global [%r15], [%rd21], 4;

Copy to b_tile : Same as a_tile

add.s64         %rd22, %rd25, %rd49;
add.s32         %r16, %r58, %r59;
cp.async.ca.shared.global [%r16], [%rd22], 4;
add.s64         %rd23, %rd22, 36;
add.s32         %r17, %r16, 12;
cp.async.ca.shared.global [%r17], [%rd23], 4;
add.s64         %rd24, %rd22, 72;
add.s32         %r18, %r16, 24;
cp.async.ca.shared.global [%r18], [%rd24], 4;

Alternatively

Layout.row_major(TPB, 1)

will distribute the work over the columns with each thread loading a row of the a and b tiles and

Layout.row_major(TPB, TPB)

will load one element of each tile per thread.

In practice the choice of loading pattern will be algorithm dependant. i.e. Should all threads participate in the load or should only a subset load with the rest consuming the data (producer, consumer pattern).

To summarize this problem does not demonstrate vectored loading with aligned_load as discussed in elementwise - Basic GPU Functional Operations - Mojo 🔥 GPU Puzzles. My interpretation of the choice of 3 threads per row/column is that it makes the problem small enough to visualize at the expense of performance.

3 Likes