I am going through the GPU puzzles and currently on the matrix tiling problem. https://puzzles.modular.com/puzzle_14/tiled.html#tile-processing-steps I don’t understand the need to the second barrier after computing tile-wise matrix multiplicaiton. My understanding is that barrier is needed when we want to synchronize threads before/after some shared resource access. However, accumulating into acc can be done in parallel since it is local to each thread and it can also be written to the output in parallel. What am I missing here?
if tiled_row < size and tiled_col < size:
@parameter
for k in range(min(TPB, size - tile * TPB)):
acc += a_shared[local_row, k] * b_shared[k, local_col]
barrier()
Great question, the key is to think about what happens after the accumulation step.
If there are more tiles to be processed then the contents of a_shared and/or b_shared can be overwritten by any threads in the block which have finished the current accumulation stage, affecting the values read by those threads which are still partisipating in the current the accumulation stage.
e.g. The below accumulation step involving a shared memory load for tile == 0
Gotcha. If you unroll the loop, then you see why the second barrier is needed. There’s a read (dot product calculation) followed by a write (block loading calculation) and if we don’t synchronize before write then we could be reading wrong values.
Now, I am starting appreciate how difficult it is to write correct and fast GPU kernels. At least I don’t have to do it in C++ and CUDA thanks to Mojo.