How do I sync threads between blocks i.e. device-wide?

I have code that runs inside a loop and I need the threads to wait for all others after each iteration.

# setting up buffers
var barrier_data = ctx.enqueue_create_buffer[DType.bool](
    num_blocks
).unsafe_ptr()
alias barrier_layout = Layout.row_major(num_blocks)
var barrier_tensor = LayoutTensor[mut=True, DType.bool, barrier_layout](
    barrier_data
)

# calls the kernel:

# inner iteration body (executed several times and should block all threads between iterations)
if thread_idx.x == 0:
    print(
        "processed:",
        processed,
        "block_idx.x:",
        block_idx.x,
        "setting to false",
    )
    barrier_buffer[block_idx.x] = False

# core kernel is executed here

# FIXME: this should block all threads in the GPU until all arrive
alias size = barrier_layout.shape[0].value()
barrier()
if thread_idx.x == 0:
    print(
        "processed:",
        processed,
        "block_idx.x:",
        block_idx.x,
        "setting to true",
    )
    barrier_buffer[block_idx.x] = True
barrier()

@parameter
fn _not_all() -> Bool:
    var res = SIMD[DType.bool, next_power_of_two(size)](True)

    @parameter
    for i in range(size):
        res[i] = rebind[Scalar[DType.bool]](barrier_buffer[i])
    return not all(res)

while thread_idx.x == 0 and _not_all():
    continue
barrier()
print(
    "processed:",
    processed,
    "block_idx.x:",
    block_idx.x,
    "passed barrier",
)

output (should have 2 blocks that run 2 iterations):

processed: 1 block_idx.x: 1 setting to false
processed: 1 block_idx.x: 0 setting to false
processed: 1 block_idx.x: 1 setting to true
processed: 1 block_idx.x: 0 setting to true
processed: 1 block_idx.x: 0 passed barrier
processed: 2 block_idx.x: 0 setting to false
processed: 2 block_idx.x: 0 setting to true
processed: 2 block_idx.x: 0 passed barrier

For some reason the block_idx.x: 1 seems to get stuck on the _not_all checking loop. I assume something to do with it caching barrier_buffer[i] instead of going out to main memory. I don’t know how else to sync threads device-wide. I tried using Semaphore several ways but I can’t get it to work

I don’t think you can synch on all the threads within a kernel. The GPU can only execute a given number of threads at a time, which is why the block limit exists and is smallish, probably 1024. The GPU runs an entire block to completion in parallel, so barrier()is easy, just pause a running thread. To make a barrier work across blocks the GPU would have to suspend and preserve the state on all the threads within a block when they hit the barrier, then start a new block which runs until it hits the same barrier, and so on. Not a good fit for GPUs.

Same problem exists in CUDA. Recommended solution seems to be to split your kernel into two, one with the code before the barrier, one after. Assuming you queue them onto the same context, all the threads in the first group should complete before any of the threads in the second start.

However, I haven’t tried this myself. If it still isn’t working, you’ll have to context.synchronize() between the two kernels. Not ideal, but you’ll still get most of the performance boost from GPU execution.

AFAIK Mojo guarantees sequential execution of enqueued functions. So yeah that would be a fallback.

What I am currently trying to do is use thread 0 of each block + a block level barrier to try to synchronize. But for some reason one of the block’s thread.x 0 gets stuck in the verification loop

This is again the scheduling problem. You have barrier_bufferset up as an array of “block completed” booleans, but every block has the _not_all loop in thread #0 which stops the entire block from completing until every other block has also reached the same point. GPU scheduling is not as flexible as threads or processes on the host CPU. It can work if the total number of blocks and threads is under the simultaneous execution limit, but as soon as you have too many blocks there will be one block that is spinning in _not_all until some other block sets the flag, but that other block can’t be started until the first block has finished, but it can’t finish until the other block has been started and reaches the same point …

Sorry, I don’t think there is any way to solve this without using two kernels.

In this case I’m getting stuck when running only 2 blocks with 1 thread each which is way bellow the limit. I could also add another branch that does the sequential enqueuing when the number of threads/blocks is beyond the limit, just like I already have a faster intra-block kernel.

In this particular case I think your original guess is right, that the writes to barrier_buffer are being cached and not visible to other blocks. A while ago I found that compute shader writes to global buffers, under some circumstances, were apparently not being committed some time after until the block completed. Threads inside the block did see the new values, but an OpenGL shader drawing the results did not.

For C/CUDA programming atomics are recommended for communicating between blocks, but I have no experience using atomics in Mojo so this is just a suggestion.

1 Like

Could the Mojo standard library semaphore possibly be what you’re looking for?

I tried using it in several ways but it always gets stuck. I’m also not sure if I’m using it correctly even after reading several examples and its source code. I also saw there is an Atomic struct but it is also not clear to me how to use it.