Should all threads be synchronized after the barrier initialization and before participating in the barrier just like in an Nvidia Asynchronous Barriers example
Hi @BeastBritish thanks for the careful read. All three points are valid and fixes are on the way.
1. Pre-barrier sync The solution now calls barrier() after the initial mbarrier_init block, and again after the per-iteration mbarrier_init(iter_barrier, TPB) reinit.
2. mbarrier_test_wait loop The Mojo docs and the underlying PTX mbarrier.test_wait both say it’s a non-blocking poll. I implemented the fix you suggested:
_ = mbarrier_arrive(bar.ptr)
while not mbarrier_test_wait(bar.ptr, TPB):
pass
Applied to all three call sites (init, iter, final).
3. Non-NVIDIA run instructions The howto table already classifies all of P29 as NVIDIA-only, so the “pixi AMD” tab has been removed from both barrier.md and memory_barrier.md. A “requires NVIDIA GPU hardware” callout has been added at the top of each page.