Puzzle 23 CUDA SIMD load, store and basic ops

cudawarped · October 1, 2025, 7:52am

In the solution to Puzzle 23 the simple add function is supposed to load and store SIMD_WIDTH values at a time and perform SIMD addition in one instruction on those values if available.

Currently it perfoms the loads, stores and additon on each element seperately, see

PTX excerpt for compute capability 86

ld.global.nc.b32        %r1, [%rd2+12];
ld.global.nc.b32        %r2, [%rd2+8];
ld.global.nc.b32        %r3, [%rd2+4];
ld.global.nc.b32        %r4, [%rd2];
ld.global.nc.b32        %r5, [%rd4+12];
ld.global.nc.b32        %r6, [%rd4+8];
ld.global.nc.b32        %r7, [%rd4+4];
ld.global.nc.b32        %r8, [%rd4];
add.f32         %r9, %r4, %r8;
add.f32         %r10, %r3, %r7;
add.f32         %r11, %r2, %r6;
add.f32         %r12, %r1, %r5;
st.global.b32   [%rd6+12], %r12;
st.global.b32   [%rd6+8], %r11;
st.global.b32   [%rd6+4], %r10;
st.global.b32   [%rd6], %r9;

I’m trying to understand the internals of mojo and I am not familiar with MLIR so I have probably confused myself but it looks like the LayoutTensor.load method

return self.ptr.load[width=width, alignment = Self.alignment](
    self._offset(m, n)
)

will only ever load Dtype width bytes per instruction. This is because the alignment value passed to UnsafePointer.load is Self.alignment and pop.load

return __mlir_op.`pop.load`[
        alignment = alignment._mlir_value,
        isVolatile = volatile._mlir_value,
        isInvariant = invariant._mlir_value,
    ](address)

is only passed the raw address. i.e. It has no way to know how many values to load per instruction apart from using alignment._mlir_value which is always going to be the alignment of the LayoutTensor Dtype.

I understand that CUDA requires this alignment and the solution is to use aligned_load and aligned_store but I can’t understand how load and store can automatically perform this vectorized operation on hardware where this is supported. What am I missing here?

Additionaly in the puzzle the add operation is supposed to perform a SIMD operation

a_simd = a.load[simd_width](idx, 0)
b_simd = b.load[simd_width](idx, 0)
ret = a_simd + b_simd

As far as I know this is not supported in CUDA, which GPU architectures currently support this?

Ehsan · October 3, 2025, 10:16pm

You’re right! Must be aligned_load/store. Previously, LayoutTensor was missing the alignment hence the closure had no idea. That was fixed. But I missed changing the load/store after including the alignment.

So with those, the PTX looks like

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        add.f32         %r12, %r7, %r11;
        add.f32         %r13, %r6, %r10;
        add.f32         %r14, %r5, %r9;
        add.f32         %r15, %r4, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r15, %r14, %r13, %r12};

As far as I know this is not supported in CUDA, which GPU architectures currently support this?

It does but is limited. For example, with alias dtype = bfloat16, the SIMD_WIDTH becomes 8 and

mojo -D DUMP_GPU_ASM=True solutions/p23/p23.mojo --elementwise

we get

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        mov.b32         %r12, 1065369472;
        fma.rn.bf16x2   %r13, %r7, %r12, %r11;
        fma.rn.bf16x2   %r14, %r6, %r12, %r10;
        fma.rn.bf16x2   %r15, %r5, %r12, %r9;
        fma.rn.bf16x2   %r16, %r4, %r12, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r16, %r15, %r14, %r13};

Topic		Replies	Views
SIMD loads on the GPU GPU Programming debugging , gpu	2	196	August 28, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	67	October 16, 2025
Questions regarding puzzle 14 GPU Programming	9	157	July 8, 2025
LayoutTensor - Type conversion Issue GPU Programming mojo-compiler	2	108	July 1, 2025
Question on unsafepointer load method Mojo	5	105	February 3, 2026

Puzzle 23 CUDA SIMD load, store and basic ops

Related topics