In Puzzle 32: Bank Conflicts, the two-way conflict kernel doesn’t always pass the test:
$ mojo problems/p32/p32.mojo --test
Testing bank conflict kernels...
No-conflict kernel test: passed
Two-way conflict kernel test: passed
Puzzle 32 complete
Now profile with NSight Compute to see performance differences!
$ mojo problems/p32/p32.mojo --test
Testing bank conflict kernels...
No-conflict kernel test: passed
Unhandled exception caught during execution: At ./problems/p32/p32.mojo:246:36: AssertionError: 278.0 is not close to 22.0 with a diff of 256.0
mojo: error: execution exited with a non-zero result: 1
The racecheck tool reports a shared memory data race when loading data with bank conflicts:
$ compute-sanitizer --tool racecheck ./problems/p32/p32_profiler --two-way
========= COMPUTE-SANITIZER
Testing bank conflict kernels...
No-conflict kernel test: passed
========= Error: Race reported between Write access at two_way_conflict_kernel+0x500 in p32.mojo:90
========= and Write access at two_way_conflict_kernel+0x500 in p32.mojo:90 [4298 hazards]
Doubling the size of the shared memory avoids the data race.
Alternatively, interleaving shared memory writes from the two halves of the thread block also solves the issue:
# CONFLICT: stride-2 access creates 2-way bank conflicts
# var conflict_index = (local_i * 2) % TPB
var conflict_index : Int
if local_i < (TPB / 2):
conflict_index = (local_i * 2) % TPB
else:
conflict_index = (local_i * 2 + 1) % TPB
Looking at the original code, I don’t quite understand why it even sometimes produces correct results, given the block level synchronization using barrier() after loading data to the shared memory.