Great question. Quick answer is yes only 3 threads out of 9 do the loading but there is no SIMD loading.
Explanation
Using a load layout
Layout.row_major(1, TPB)
with TPB==3 columns over a 3x3 tile distributes the copy operation over the rows. i.e. The TPB active threads load a row (TPB columns) of a and b and because there are 3 rows in each tile this is performed 3 times. Alternatively (thread perspective) each thread loads a column of the a and b tile one element at a time. You can see this if you output the PTX while running solution as
mojo -D DUMP_GPU_ASM=True solutions/p16/p16.mojo --idiomatic-tiled
Copy to a_tile: Three 4 byte load instructions, each one seperated by a 36=9*4 byte stride (SIZE_TILED element row of the matrix).
add.s64 %rd19, %rd26, %rd49;
add.s32 %r13, %r57, %r59;
cp.async.ca.shared.global [%r13], [%rd19], 4;
add.s64 %rd20, %rd19, 36;
add.s32 %r14, %r13, 12;
cp.async.ca.shared.global [%r14], [%rd20], 4;
add.s64 %rd21, %rd19, 72;
add.s32 %r15, %r13, 24;
cp.async.ca.shared.global [%r15], [%rd21], 4;
Copy to b_tile : Same as a_tile
add.s64 %rd22, %rd25, %rd49;
add.s32 %r16, %r58, %r59;
cp.async.ca.shared.global [%r16], [%rd22], 4;
add.s64 %rd23, %rd22, 36;
add.s32 %r17, %r16, 12;
cp.async.ca.shared.global [%r17], [%rd23], 4;
add.s64 %rd24, %rd22, 72;
add.s32 %r18, %r16, 24;
cp.async.ca.shared.global [%r18], [%rd24], 4;
Alternatively
Layout.row_major(TPB, 1)
will distribute the work over the columns with each thread loading a row of the a and b tiles and
Layout.row_major(TPB, TPB)
will load one element of each tile per thread.
In practice the choice of loading pattern will be algorithm dependant. i.e. Should all threads participate in the load or should only a subset load with the rest consuming the data (producer, consumer pattern).
To summarize this problem does not demonstrate vectored loading with aligned_load as discussed in elementwise - Basic GPU Functional Operations - Mojo 🔥 GPU Puzzles. My interpretation of the choice of 3 threads per row/column is that it makes the problem small enough to visualize at the expense of performance.