This is again the scheduling problem. You have barrier_bufferset up as an array of “block completed” booleans, but every block has the _not_all loop in thread #0 which stops the entire block from completing until every other block has also reached the same point. GPU scheduling is not as flexible as threads or processes on the host CPU. It can work if the total number of blocks and threads is under the simultaneous execution limit, but as soon as you have too many blocks there will be one block that is spinning in _not_all until some other block sets the flag, but that other block can’t be started until the first block has finished, but it can’t finish until the other block has been started and reaches the same point …
Sorry, I don’t think there is any way to solve this without using two kernels.