I have ported a CUDA kernel to Mojo, and verified its correctness. The performance was worse, so I did Nsight Compute analysis and it told:
The memory access pattern for global loads from L2 might not be optimal. On average, only 4.0 of the 32 bytes transmitted per sector are utilized by each thread
(on the code lines that follow).
The CUDA code is doing a 128-bit SIMD load as follows:
reinterpret_cast<vec_t*>(x_vals_load)[c] = reinterpret_cast<const vec_t*>(x)[c * blockDim.x + threadIdx.x]
where x_vals_loaded is a thread local float array and x is the pointer to global memory (and vec_t is a struct of 4 floats.)
In Mojo I translated it like this:
vec = x.load[width=4](batch_id, base_idx)
where x is a 2D LayoutTensor.
Looking into the PTX I have discovered that in Mojo we get:
13 00007f82 6f26c3c0 LDG.E R26, [R4.64+0xc] 0 7 0.74% 0.05% 32 Global Load 32 0.05%
14 00007f82 6f26c3d0 LDG.E R27, [R4.64+0x8] 0 8 0.33% 0.05% 32 Global Load 32 0.05%
15 00007f82 6f26c3e0 LDG.E R32, [R4.64+0x4] 0 9 0.27% 0.05% 32 Global Load 32 0.05%
16 00007f82 6f26c3f0 LDG.E R33, [R4.64] 0 10 0.16% 0.05% 32 Global Load 32 0.05%
but in CUDA we get:
@!P0 LDG.E.128 R72, [R2.64]
I have tried a lot of ways and have not been able to get mojo to output this 128-bit loads. I have tested on an RTX 3090 and an H200.
here is a minimal reproduction:
from gpu import thread_idx
from gpu.host import DeviceContext
from layout import Layout, LayoutTensor
alias SIZE = 64
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn test_vectorized_loads(
output: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
input: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
ctx: DeviceContext,
) raises:
@parameter
fn kernel():
if thread_idx.x == 0:
# 128-bit load: 4 x float32
vec4 = input.load[4](0, 0)
output.store[4](0, 0, vec4 * 2.0)
# 64-bit load: 2 x float32
vec2 = input.load[2](16, 0)
output.store[2](16, 0, vec2 * 3.0)
ctx.enqueue_function[kernel](grid_dim=(1, 1), block_dim=(1, 1))
def main():
ctx = DeviceContext()
input_buf = ctx.enqueue_create_buffer[dtype](SIZE)
output_buf = ctx.enqueue_create_buffer[dtype](SIZE).enqueue_fill(0)
with input_buf.map_to_host() as input_host:
for i in range(SIZE):
input_host[i] = Float32(i + 1)
input_tensor = LayoutTensor[mut=False, dtype, layout](input_buf.unsafe_ptr())
output_tensor = LayoutTensor[mut=True, dtype, layout](output_buf.unsafe_ptr())
test_vectorized_loads(output_tensor, input_tensor, ctx)
ctx.synchronize()
and then we can find the 32bit loads and stores with: mojo build --emit=llvm --debug-level=line-tables vectorization_repro.mojo
and grep -oP “ld.global.\S*” vectorization_repro.ll
I am assuming that I am making some mistake, or is this not yet supported by mojo?
Many Thanks for any help!
Cheers,
Johannes