How do I sync threads between blocks i.e. device-wide?

I don’t think you can synch on all the threads within a kernel. The GPU can only execute a given number of threads at a time, which is why the block limit exists and is smallish, probably 1024. The GPU runs an entire block to completion in parallel, so barrier()is easy, just pause a running thread. To make a barrier work across blocks the GPU would have to suspend and preserve the state on all the threads within a block when they hit the barrier, then start a new block which runs until it hits the same barrier, and so on. Not a good fit for GPUs.

Same problem exists in CUDA. Recommended solution seems to be to split your kernel into two, one with the code before the barrier, one after. Assuming you queue them onto the same context, all the threads in the first group should complete before any of the threads in the second start.

However, I haven’t tried this myself. If it still isn’t working, you’ll have to context.synchronize() between the two kernels. Not ideal, but you’ll still get most of the performance boost from GPU execution.