Puzzle 23 CUDA SIMD load, store and basic ops

You’re right! Must be aligned_load/store. Previously, LayoutTensor was missing the alignment hence the closure had no idea. That was fixed. But I missed changing the load/store after including the alignment.

So with those, the PTX looks like

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        add.f32         %r12, %r7, %r11;
        add.f32         %r13, %r6, %r10;
        add.f32         %r14, %r5, %r9;
        add.f32         %r15, %r4, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r15, %r14, %r13, %r12};

As far as I know this is not supported in CUDA, which GPU architectures currently support this?

It does but is limited. For example, with alias dtype = bfloat16, the SIMD_WIDTH becomes 8 and

mojo -D DUMP_GPU_ASM=True solutions/p23/p23.mojo --elementwise

we get

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        mov.b32         %r12, 1065369472;
        fma.rn.bf16x2   %r13, %r7, %r12, %r11;
        fma.rn.bf16x2   %r14, %r6, %r12, %r10;
        fma.rn.bf16x2   %r15, %r5, %r12, %r9;
        fma.rn.bf16x2   %r16, %r4, %r12, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r16, %r15, %r14, %r13};
1 Like