SIMD loads on the GPU

jaidmin · August 13, 2025, 11:49am

I have ported a CUDA kernel to Mojo, and verified its correctness. The performance was worse, so I did Nsight Compute analysis and it told:

The memory access pattern for global loads from L2 might not be optimal. On average, only 4.0 of the 32 bytes transmitted per sector are utilized by each thread

(on the code lines that follow).

The CUDA code is doing a 128-bit SIMD load as follows:

reinterpret_cast<vec_t*>(x_vals_load)[c] = reinterpret_cast<const vec_t*>(x)[c * blockDim.x + threadIdx.x]

where x_vals_loaded is a thread local float array and x is the pointer to global memory (and vec_t is a struct of 4 floats.)

In Mojo I translated it like this:

vec = x.load[width=4](batch_id, base_idx) where x is a 2D LayoutTensor.

Looking into the PTX I have discovered that in Mojo we get:
13 00007f82 6f26c3c0 LDG.E R26, [R4.64+0xc] 0 7 0.74% 0.05% 32 Global Load 32 0.05%
14 00007f82 6f26c3d0 LDG.E R27, [R4.64+0x8] 0 8 0.33% 0.05% 32 Global Load 32 0.05%
15 00007f82 6f26c3e0 LDG.E R32, [R4.64+0x4] 0 9 0.27% 0.05% 32 Global Load 32 0.05%
16 00007f82 6f26c3f0 LDG.E R33, [R4.64] 0 10 0.16% 0.05% 32 Global Load 32 0.05%

but in CUDA we get:

@!P0 LDG.E.128 R72, [R2.64]

I have tried a lot of ways and have not been able to get mojo to output this 128-bit loads. I have tested on an RTX 3090 and an H200.

here is a minimal reproduction:

from gpu import thread_idx
from gpu.host import DeviceContext
from layout import Layout, LayoutTensor

alias SIZE = 64
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)

fn test_vectorized_loads(
    output: LayoutTensor[mut=True, dtype, layout, MutableAnyOrigin],
    input: LayoutTensor[mut=False, dtype, layout, MutableAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    fn kernel():
        if thread_idx.x == 0:
            
            # 128-bit load: 4 x float32
            vec4 = input.load[4](0, 0)
            output.store[4](0, 0, vec4 * 2.0)
            
            # 64-bit load: 2 x float32  
            vec2 = input.load[2](16, 0)
            output.store[2](16, 0, vec2 * 3.0)
    
    ctx.enqueue_function[kernel](grid_dim=(1, 1), block_dim=(1, 1))

def main():
    ctx = DeviceContext()
    input_buf = ctx.enqueue_create_buffer[dtype](SIZE)
    output_buf = ctx.enqueue_create_buffer[dtype](SIZE).enqueue_fill(0)
    with input_buf.map_to_host() as input_host:
        for i in range(SIZE):
            input_host[i] = Float32(i + 1)
    
    input_tensor = LayoutTensor[mut=False, dtype, layout](input_buf.unsafe_ptr())
    output_tensor = LayoutTensor[mut=True, dtype, layout](output_buf.unsafe_ptr())
    
    test_vectorized_loads(output_tensor, input_tensor, ctx)
    ctx.synchronize()

and then we can find the 32bit loads and stores with: mojo build --emit=llvm --debug-level=line-tables vectorization_repro.mojo and grep -oP “ld.global.\S*” vectorization_repro.ll

I am assuming that I am making some mistake, or is this not yet supported by mojo?

Many Thanks for any help!

Cheers,
Johannes

Ehsan · August 13, 2025, 6:10pm

Thanks for reporting this! this is an issue that we need to fix. Team is already working on it but for the record could you please create an issue here?

Topic		Replies	Views
LayoutTensor - Type conversion Issue GPU Programming mojo-compiler	2	62	July 1, 2025
GPU Programming Manual Community Showcase gpu , docs , modular-content	17	526	March 26, 2025
Ask Ahmed anything about GPU programming with Mojo (LLVM Developers' Meeting 2024) Community Showcase modular-content , llvm , ask-me-anything	10	807	June 8, 2025
[Hackathon] Experiment: CUDA Kernel → Mojo in Bitnet Community Showcase modular-hack-weekend	2	69	June 30, 2025
CUDA_ERROR_ILLEGAL_ADDRESS when running p19 solution of mojo-gpu-puzzles GPU Programming gpu_puzzle	1	49	July 20, 2025

SIMD loads on the GPU

Related topics