How do I sync threads between blocks i.e. device-wide?

laranzu · August 12, 2025, 10:49pm

I don’t think you can synch on all the threads within a kernel. The GPU can only execute a given number of threads at a time, which is why the block limit exists and is smallish, probably 1024. The GPU runs an entire block to completion in parallel, so barrier()is easy, just pause a running thread. To make a barrier work across blocks the GPU would have to suspend and preserve the state on all the threads within a block when they hit the barrier, then start a new block which runs until it hits the same barrier, and so on. Not a good fit for GPUs.

Same problem exists in CUDA. Recommended solution seems to be to split your kernel into two, one with the code before the barrier, one after. Assuming you queue them onto the same context, all the threads in the first group should complete before any of the threads in the second start.

However, I haven’t tried this myself. If it still isn’t working, you’ll have to context.synchronize() between the two kernels. Not ideal, but you’ll still get most of the performance boost from GPU execution.

Topic		Replies	Views
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	199	July 4, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	68	October 16, 2025
GPU Puzzles Blog Series - My Learning Experiences Community Showcase	1	90	February 3, 2026
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	7	571	September 3, 2025
Are warps, blocks, block clusters, and device-wide workloads automatically scheduled depending on the grid dimensions? Mojo gpu	4	231	December 16, 2025

How do I sync threads between blocks i.e. device-wide?

Related topics