Tiled Matrix Multiplication Puzzle

ajinkya · July 4, 2025, 12:28am

Gotcha. If you unroll the loop, then you see why the second barrier is needed. There’s a read (dot product calculation) followed by a write (block loading calculation) and if we don’t synchronize before write then we could be reading wrong values.

Now, I am starting appreciate how difficult it is to write correct and fast GPU kernels. At least I don’t have to do it in C++ and CUDA thanks to Mojo.

Topic		Replies	Views
Questions regarding puzzle 14 GPU Programming	9	157	July 8, 2025
Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul Content blog	3	109	September 6, 2025
How do I sync threads between blocks i.e. device-wide? GPU Programming	7	175	August 14, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	68	October 16, 2025
Puzzle 23. Why use strided loading of tiles? GPU Programming gpu_puzzle	4	142	October 3, 2025

Tiled Matrix Multiplication Puzzle

Related topics