Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul

cudawarped · October 16, 2025, 11:32am

Great question. Quick answer is yes only 3 threads out of 9 do the loading but there is no SIMD loading.

Explanation

Using a load layout

Layout.row_major(1, TPB)

with TPB==3 columns over a 3x3 tile distributes the copy operation over the rows. i.e. The TPB active threads load a row (TPB columns) of a and b and because there are 3 rows in each tile this is performed 3 times. Alternatively (thread perspective) each thread loads a column of the a and b tile one element at a time. You can see this if you output the PTX while running solution as

mojo -D DUMP_GPU_ASM=True solutions/p16/p16.mojo --idiomatic-tiled

Copy to a_tile: Three 4 byte load instructions, each one seperated by a 36=9*4 byte stride (SIZE_TILED element row of the matrix).

add.s64         %rd19, %rd26, %rd49;
add.s32         %r13, %r57, %r59;
cp.async.ca.shared.global [%r13], [%rd19], 4;
add.s64         %rd20, %rd19, 36;
add.s32         %r14, %r13, 12;
cp.async.ca.shared.global [%r14], [%rd20], 4;
add.s64         %rd21, %rd19, 72;
add.s32         %r15, %r13, 24;
cp.async.ca.shared.global [%r15], [%rd21], 4;

Copy to b_tile : Same as a_tile

add.s64         %rd22, %rd25, %rd49;
add.s32         %r16, %r58, %r59;
cp.async.ca.shared.global [%r16], [%rd22], 4;
add.s64         %rd23, %rd22, 36;
add.s32         %r17, %r16, 12;
cp.async.ca.shared.global [%r17], [%rd23], 4;
add.s64         %rd24, %rd22, 72;
add.s32         %r18, %r16, 24;
cp.async.ca.shared.global [%r18], [%rd24], 4;

Alternatively

Layout.row_major(TPB, 1)

will distribute the work over the columns with each thread loading a row of the a and b tiles and

Layout.row_major(TPB, TPB)

will load one element of each tile per thread.

In practice the choice of loading pattern will be algorithm dependant. i.e. Should all threads participate in the load or should only a subset load with the rest consuming the data (producer, consumer pattern).

To summarize this problem does not demonstrate vectored loading with aligned_load as discussed in elementwise - Basic GPU Functional Operations - Mojo 🔥 GPU Puzzles. My interpretation of the choice of 3 threads per row/column is that it makes the problem small enough to visualize at the expense of performance.

Topic		Replies	Views
Questions regarding puzzle 14 GPU Programming	9	157	July 8, 2025
Question regarding Mojo SOTA Blackwell matmul part 2 blog: about TMA load GPU Programming blog	3	108	November 4, 2025
SIMD loads on the GPU GPU Programming debugging , gpu	2	201	August 28, 2025
Purpose of num_threads in copy_dram_to_sram_async GPU Programming	1	90	July 22, 2025
Puzzle 23. Why use strided loading of tiles? GPU Programming gpu_puzzle	4	142	October 3, 2025

Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul

Explanation

Related topics