How do I sync threads between blocks i.e. device-wide?

laranzu · August 13, 2025, 10:39am

This is again the scheduling problem. You have barrier_bufferset up as an array of “block completed” booleans, but every block has the _not_all loop in thread #0 which stops the entire block from completing until every other block has also reached the same point. GPU scheduling is not as flexible as threads or processes on the host CPU. It can work if the total number of blocks and threads is under the simultaneous execution limit, but as soon as you have too many blocks there will be one block that is spinning in _not_all until some other block sets the flag, but that other block can’t be started until the first block has finished, but it can’t finish until the other block has been started and reaches the same point …

Sorry, I don’t think there is any way to solve this without using two kernels.

Topic		Replies	Views
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	197	July 4, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	68	October 16, 2025
GPU Puzzles Blog Series - My Learning Experiences Community Showcase	1	89	February 3, 2026
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	7	570	September 3, 2025
Are warps, blocks, block clusters, and device-wide workloads automatically scheduled depending on the grid dimensions? Mojo gpu	4	231	December 16, 2025

How do I sync threads between blocks i.e. device-wide?

Related topics