Gotcha. If you unroll the loop, then you see why the second barrier is needed. There’s a read (dot product calculation) followed by a write (block loading calculation) and if we don’t synchronize before write then we could be reading wrong values.
Now, I am starting appreciate how difficult it is to write correct and fast GPU kernels. At least I don’t have to do it in C++ and CUDA thanks to Mojo.