I’ve stumbled upon the standard library prefix_sum
and tried to launch a very simple kernel that just dispatches a single block of size 32 on an NVidia V100 (hence, warp size is also 32) and wanted to compute the prefix_sum
over the single warp. Unfortunately, I’m getting the incorrect results and am trying to figure out why.
I’ve looked deeper into the warp.prefix_sum
code and printed some values within it and my suspicion is that something awkward is happening in the shuffle_up
(?), but I might not understand it well enough. I’m using the latest Mojo version (mojo 25.4.0.dev2025050605 (b840f403)
).
Can someone help me figure out what I’m doing wrong?
Code and debug printing logs are attached in the Gist: