Puzzle 32 shared memory data race

In Puzzle 32: Bank Conflicts, the two-way conflict kernel doesn’t always pass the test:

$ mojo problems/p32/p32.mojo --test
Testing bank conflict kernels...
No-conflict kernel test: passed
Two-way conflict kernel test: passed
Puzzle 32 complete
Now profile with NSight Compute to see performance differences!
$ mojo problems/p32/p32.mojo --test
Testing bank conflict kernels...
No-conflict kernel test: passed
Unhandled exception caught during execution: At ./problems/p32/p32.mojo:246:36: AssertionError: 278.0 is not close to 22.0 with a diff of 256.0
mojo: error: execution exited with a non-zero result: 1

The racecheck tool reports a shared memory data race when loading data with bank conflicts:

$ compute-sanitizer --tool racecheck ./problems/p32/p32_profiler --two-way

========= COMPUTE-SANITIZER
Testing bank conflict kernels...
No-conflict kernel test: passed
========= Error: Race reported between Write access at two_way_conflict_kernel+0x500 in p32.mojo:90
=========     and Write access at two_way_conflict_kernel+0x500 in p32.mojo:90 [4298 hazards]

Doubling the size of the shared memory avoids the data race.

Alternatively, interleaving shared memory writes from the two halves of the thread block also solves the issue:

    # CONFLICT: stride-2 access creates 2-way bank conflicts
#    var conflict_index = (local_i * 2) % TPB
     var conflict_index : Int
     if local_i < (TPB / 2):
         conflict_index = (local_i * 2) % TPB
     else:
         conflict_index = (local_i * 2 + 1) % TPB

Looking at the original code, I don’t quite understand why it even sometimes produces correct results, given the block level synchronization using barrier() after loading data to the shared memory.

Thanks for the write-up @BeastBritish! You’re correct this was a write-write race, not a bank conflict.

The root cause: with TPB = 256, conflict_index = (local_i * 2) % TPB makes threads i and i + 128 map to the same shared-memory address, so they write different values to the same slot. barrier() only orders the write phase before the read phase — it doesn’t decide which of the two racing writers wins. That’s why results are non-deterministic.

The fix going in is your option (1): bump the buffer to 2 * TPB and use local_i * 2 without the modulo. I implemented this and tested it on a compatible GPU so the fix will go out with the next nightly release.