Puzzle 29 Barriers

Should all threads be synchronized after the barrier initialization and before participating in the barrier just like in an Nvidia Asynchronous Barriers example

https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/async-barriers.html#initialization

Furthermore, since mbarrier_test_wait function is a “Non-blocking check to see if all participating threads have reached the barrier”

https://docs.modular.com/mojo/std/gpu/sync/sync/mbarrier_test_wait

should the barrier tests be done in a loop to wait until all the threads have reached the barrier, e.g.

    `while(mbarrier_test_wait(iter_barrier.ptr, TPB) == False):`

pass

instead of making a single call

     `_ = mbarrier_test_wait(iter_barrier.ptr, TPB)`

Finally, since the puzzle is only supported on NVIDIA GPUs, should the example commands to run your solution on the other GPUs be removed?

Hi @BeastBritish thanks for the careful read. All three points are valid and fixes are on the way.

1. Pre-barrier sync The solution now calls barrier() after the initial mbarrier_init block, and again after the per-iteration mbarrier_init(iter_barrier, TPB) reinit.

2. mbarrier_test_wait loop The Mojo docs and the underlying PTX mbarrier.test_wait both say it’s a non-blocking poll. I implemented the fix you suggested:

_ = mbarrier_arrive(bar.ptr)
while not mbarrier_test_wait(bar.ptr, TPB):
    pass

Applied to all three call sites (init, iter, final).

3. Non-NVIDIA run instructions The howto table already classifies all of P29 as NVIDIA-only, so the “pixi AMD” tab has been removed from both barrier.md and memory_barrier.md. A “requires NVIDIA GPU hardware” callout has been added at the top of each page.

Thanks again for the high-quality bug report!