Puzzle 23 CUDA SIMD load, store and basic ops

In the solution to Puzzle 23 the simple add function is supposed to load and store SIMD_WIDTH values at a time and perform SIMD addition in one instruction on those values if available.

Currently it perfoms the loads, stores and additon on each element seperately, see

PTX excerpt for compute capability 86
ld.global.nc.b32        %r1, [%rd2+12];
ld.global.nc.b32        %r2, [%rd2+8];
ld.global.nc.b32        %r3, [%rd2+4];
ld.global.nc.b32        %r4, [%rd2];
ld.global.nc.b32        %r5, [%rd4+12];
ld.global.nc.b32        %r6, [%rd4+8];
ld.global.nc.b32        %r7, [%rd4+4];
ld.global.nc.b32        %r8, [%rd4];
add.f32         %r9, %r4, %r8;
add.f32         %r10, %r3, %r7;
add.f32         %r11, %r2, %r6;
add.f32         %r12, %r1, %r5;
st.global.b32   [%rd6+12], %r12;
st.global.b32   [%rd6+8], %r11;
st.global.b32   [%rd6+4], %r10;
st.global.b32   [%rd6], %r9;

I’m trying to understand the internals of mojo and I am not familiar with MLIR so I have probably confused myself but it looks like the LayoutTensor.load method

return self.ptr.load[width=width, alignment = Self.alignment](
    self._offset(m, n)
)

will only ever load Dtype width bytes per instruction. This is because the alignment value passed to UnsafePointer.load is Self.alignment and pop.load

return __mlir_op.`pop.load`[
        alignment = alignment._mlir_value,
        isVolatile = volatile._mlir_value,
        isInvariant = invariant._mlir_value,
    ](address)

is only passed the raw address. i.e. It has no way to know how many values to load per instruction apart from using alignment._mlir_value which is always going to be the alignment of the LayoutTensor Dtype.

I understand that CUDA requires this alignment and the solution is to use aligned_load and aligned_store but I can’t understand how load and store can automatically perform this vectorized operation on hardware where this is supported. What am I missing here?

Additionaly in the puzzle the add operation is supposed to perform a SIMD operation

a_simd = a.load[simd_width](idx, 0)
b_simd = b.load[simd_width](idx, 0)
ret = a_simd + b_simd

As far as I know this is not supported in CUDA, which GPU architectures currently support this?

You’re right! Must be aligned_load/store. Previously, LayoutTensor was missing the alignment hence the closure had no idea. That was fixed. But I missed changing the load/store after including the alignment.

So with those, the PTX looks like

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        add.f32         %r12, %r7, %r11;
        add.f32         %r13, %r6, %r10;
        add.f32         %r14, %r5, %r9;
        add.f32         %r15, %r4, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r15, %r14, %r13, %r12};

As far as I know this is not supported in CUDA, which GPU architectures currently support this?

It does but is limited. For example, with alias dtype = bfloat16, the SIMD_WIDTH becomes 8 and

mojo -D DUMP_GPU_ASM=True solutions/p23/p23.mojo --elementwise

we get

        ld.global.nc.v4.b32     {%r4, %r5, %r6, %r7}, [%rd24];
        add.s64         %rd25, %rd2, %rd27;
        ld.global.nc.v4.b32     {%r8, %r9, %r10, %r11}, [%rd25];
        mov.b32         %r12, 1065369472;
        fma.rn.bf16x2   %r13, %r7, %r12, %r11;
        fma.rn.bf16x2   %r14, %r6, %r12, %r10;
        fma.rn.bf16x2   %r15, %r5, %r12, %r9;
        fma.rn.bf16x2   %r16, %r4, %r12, %r8;
        add.s64         %rd26, %rd3, %rd27;
        st.global.v4.b32        [%rd26], {%r16, %r15, %r14, %r13};
1 Like