`prefix_sum` incorrect results with `gpu.warp.prefix_sum` and `gpu.block.prefix_sum`

kirillbobyrev · May 7, 2025, 2:54am

I’ve stumbled upon the standard library prefix_sum and tried to launch a very simple kernel that just dispatches a single block of size 32 on an NVidia V100 (hence, warp size is also 32) and wanted to compute the prefix_sum over the single warp. Unfortunately, I’m getting the incorrect results and am trying to figure out why.

I’ve looked deeper into the warp.prefix_sum code and printed some values within it and my suspicion is that something awkward is happening in the shuffle_up (?), but I might not understand it well enough. I’m using the latest Mojo version (mojo 25.4.0.dev2025050605 (b840f403)).

Can someone help me figure out what I’m doing wrong?

Code and debug printing logs are attached in the Gist:

gist.github.com

https://gist.github.com/kirillbobyrev/c7a7b959dead64a60d6f3f630985bd7a

debug_logs.txt

input elements:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
Initial value= 0 id= 0
Initial value= 1 id= 1
Initial value= 2 id= 2
Initial value= 3 id= 3
Initial value= 4 id= 4
Initial value= 5 id= 5
Initial value= 6 id= 6
Initial value= 7 id= 7

This file has been truncated. show original

prefix_sum.mojo

from gpu.globals import WARP_SIZE
from math import ceildiv
from sys import env_get_int
from memory import memset_zero
from benchmark import Bench, Bencher, BenchId, BenchMetric, ThroughputMeasure
from builtin._closure import __ownership_keepalive
from gpu import *
from gpu.host import DeviceContext
from testing import assert_equal
from random import randint

This file has been truncated. show original

kirillbobyrev · May 7, 2025, 5:02am

Update: it looks like shuffle_up is correct, but stdlib’s gpu.warp.prefix_sum is not: [stdlib] Fix the warp prefix sum algorithm on gpu by kirillbobyrev · Pull Request #4508 · modular/modular · GitHub should fix the issue.

BradLarson · May 7, 2025, 2:56pm

Huge thanks for the PR to fix this!

Random question: you mentioned over on the GPU MODE Discord that you were running this on V100, did you encounter any issues building Mojo code for that GPU? We only recently were able to lower the floor for GPU support to Turing (sm_75), so I’m surprised that this worked for you on Volta (sm_70). Did you have to hack anything in your Mojo standard library to get that to work for you?

kirillbobyrev · May 7, 2025, 4:09pm

Oh, you’re right, apologies for the confusion. I meant A100 (I’ve ran on A10 and A100, both were fine). I’ve tried running on V100, but I ran into other problems (not Mojo/Modular-related).

Topic		Replies	Views
Mojo manual gpu basics exercise does not compile GPU Programming 25_3	7	142	April 2, 2025
GPU Programming Manual Community Showcase gpu , docs , modular-content	17	527	March 26, 2025
CUDA_ERROR_ILLEGAL_ADDRESS when running p19 solution of mojo-gpu-puzzles GPU Programming gpu_puzzle	1	49	July 20, 2025
GPU Puzzles P09 Shared memory indexing issue Standard Library gpu	2	87	June 27, 2025
Gpu-puzzles: initialization of shared_a in problem 11 General debugging	3	44	July 10, 2025

`prefix_sum` incorrect results with `gpu.warp.prefix_sum` and `gpu.block.prefix_sum`

Related topics