Puzzle 23 CUDA SIMD load, store and basic ops

Ehsan · October 3, 2025, 10:16pm

You’re right! Must be aligned_load/store. Previously, LayoutTensor was missing the alignment hence the closure had no idea. That was fixed. But I missed changing the load/store after including the alignment.

So with those, the PTX looks like

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        add.f32         %r12, %r7, %r11;
        add.f32         %r13, %r6, %r10;
        add.f32         %r14, %r5, %r9;
        add.f32         %r15, %r4, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r15, %r14, %r13, %r12};

As far as I know this is not supported in CUDA, which GPU architectures currently support this?

It does but is limited. For example, with alias dtype = bfloat16, the SIMD_WIDTH becomes 8 and

mojo -D DUMP_GPU_ASM=True solutions/p23/p23.mojo --elementwise

we get

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        mov.b32         %r12, 1065369472;
        fma.rn.bf16x2   %r13, %r7, %r12, %r11;
        fma.rn.bf16x2   %r14, %r6, %r12, %r10;
        fma.rn.bf16x2   %r15, %r5, %r12, %r9;
        fma.rn.bf16x2   %r16, %r4, %r12, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r16, %r15, %r14, %r13};

Topic		Replies	Views
SIMD loads on the GPU GPU Programming debugging , gpu	2	196	August 28, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	67	October 16, 2025
Questions regarding puzzle 14 GPU Programming	9	157	July 8, 2025
LayoutTensor - Type conversion Issue GPU Programming mojo-compiler	2	108	July 1, 2025
Question on unsafepointer load method Mojo	5	105	February 3, 2026

Puzzle 23 CUDA SIMD load, store and basic ops

Related topics