Puzzle 32 shared memory data race

BeastBritish · May 12, 2026, 10:11am

In Puzzle 32: Bank Conflicts, the two-way conflict kernel doesn’t always pass the test:

$ mojo problems/p32/p32.mojo --test
Testing bank conflict kernels...
No-conflict kernel test: passed
Two-way conflict kernel test: passed
Puzzle 32 complete
Now profile with NSight Compute to see performance differences!

$ mojo problems/p32/p32.mojo --test
Testing bank conflict kernels...
No-conflict kernel test: passed
Unhandled exception caught during execution: At ./problems/p32/p32.mojo:246:36: AssertionError: 278.0 is not close to 22.0 with a diff of 256.0
mojo: error: execution exited with a non-zero result: 1

The racecheck tool reports a shared memory data race when loading data with bank conflicts:

$ compute-sanitizer --tool racecheck ./problems/p32/p32_profiler --two-way

========= COMPUTE-SANITIZER
Testing bank conflict kernels...
No-conflict kernel test: passed
========= Error: Race reported between Write access at two_way_conflict_kernel+0x500 in p32.mojo:90
=========     and Write access at two_way_conflict_kernel+0x500 in p32.mojo:90 [4298 hazards]

Doubling the size of the shared memory avoids the data race.

Alternatively, interleaving shared memory writes from the two halves of the thread block also solves the issue:

    # CONFLICT: stride-2 access creates 2-way bank conflicts
#    var conflict_index = (local_i * 2) % TPB
     var conflict_index : Int
     if local_i < (TPB / 2):
         conflict_index = (local_i * 2) % TPB
     else:
         conflict_index = (local_i * 2 + 1) % TPB

Looking at the original code, I don’t quite understand why it even sometimes produces correct results, given the block level synchronization using barrier() after loading data to the shared memory.

dunnoconnor · May 12, 2026, 7:50pm

Thanks for the write-up @BeastBritish! You’re correct this was a write-write race, not a bank conflict.

The root cause: with TPB = 256, conflict_index = (local_i * 2) % TPB makes threads i and i + 128 map to the same shared-memory address, so they write different values to the same slot. barrier() only orders the write phase before the read phase — it doesn’t decide which of the two racing writers wins. That’s why results are non-deterministic.

The fix going in is your option (1): bump the buffer to 2 * TPB and use local_i * 2 without the modulo. I implemented this and tested it on a compatible GPU so the fix will go out with the next nightly release.

Topic		Replies	Views
GPU Puzzles P09 Shared memory indexing issue Standard Library gpu	2	117	June 27, 2025
Leetgpu, tensara how to handle shared memory? GPU Programming gpu	1	243	June 26, 2025
Questions regarding puzzle 14 GPU Programming	9	213	July 8, 2025
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	288	July 4, 2025
Gpu-puzzles: initialization of shared_a in problem 11 General debugging	2	86	July 10, 2025

Puzzle 32 shared memory data race

Related topics