You’re right! Must be aligned_load/store. Previously, LayoutTensor was missing the alignment hence the closure had no idea. That was fixed. But I missed changing the load/store after including the alignment.
So with those, the PTX looks like
ld.global.nc.v4.b32 {%r4, %r5, %r6, %r7}, [%rd24];
add.s64 %rd25, %rd2, %rd27;
ld.global.nc.v4.b32 {%r8, %r9, %r10, %r11}, [%rd25];
add.f32 %r12, %r7, %r11;
add.f32 %r13, %r6, %r10;
add.f32 %r14, %r5, %r9;
add.f32 %r15, %r4, %r8;
add.s64 %rd26, %rd3, %rd27;
st.global.v4.b32 [%rd26], {%r15, %r14, %r13, %r12};
As far as I know this is not supported in CUDA, which GPU architectures currently support this?
It does but is limited. For example, with alias dtype = bfloat16, the SIMD_WIDTH becomes 8 and
mojo -D DUMP_GPU_ASM=True solutions/p23/p23.mojo --elementwise
we get
ld.global.nc.v4.b32 {%r4, %r5, %r6, %r7}, [%rd24];
add.s64 %rd25, %rd2, %rd27;
ld.global.nc.v4.b32 {%r8, %r9, %r10, %r11}, [%rd25];
mov.b32 %r12, 1065369472;
fma.rn.bf16x2 %r13, %r7, %r12, %r11;
fma.rn.bf16x2 %r14, %r6, %r12, %r10;
fma.rn.bf16x2 %r15, %r5, %r12, %r9;
fma.rn.bf16x2 %r16, %r4, %r12, %r8;
add.s64 %rd26, %rd3, %rd27;
st.global.v4.b32 [%rd26], {%r16, %r15, %r14, %r13};