`prefix_sum` incorrect results with `gpu.warp.prefix_sum` and `gpu.block.prefix_sum`

I’ve stumbled upon the standard library prefix_sum and tried to launch a very simple kernel that just dispatches a single block of size 32 on an NVidia V100 (hence, warp size is also 32) and wanted to compute the prefix_sum over the single warp. Unfortunately, I’m getting the incorrect results and am trying to figure out why.

I’ve looked deeper into the warp.prefix_sum code and printed some values within it and my suspicion is that something awkward is happening in the shuffle_up (?), but I might not understand it well enough. I’m using the latest Mojo version (mojo 25.4.0.dev2025050605 (b840f403)).

Can someone help me figure out what I’m doing wrong?

Code and debug printing logs are attached in the Gist:

1 Like

Update: it looks like shuffle_up is correct, but stdlib’s gpu.warp.prefix_sum is not: [stdlib] Fix the warp prefix sum algorithm on gpu by kirillbobyrev · Pull Request #4508 · modular/modular · GitHub should fix the issue.

Huge thanks for the PR to fix this!

Random question: you mentioned over on the GPU MODE Discord that you were running this on V100, did you encounter any issues building Mojo code for that GPU? We only recently were able to lower the floor for GPU support to Turing (sm_75), so I’m surprised that this worked for you on Volta (sm_70). Did you have to hack anything in your Mojo standard library to get that to work for you?

Oh, you’re right, apologies for the confusion. I meant A100 (I’ve ran on A10 and A100, both were fine). I’ve tried running on V100, but I ran into other problems (not Mojo/Modular-related).

1 Like