Tiled Matrix Multiplication Puzzle

ajinkya · July 2, 2025, 5:14pm

I am going through the GPU puzzles and currently on the matrix tiling problem. https://puzzles.modular.com/puzzle_14/tiled.html#tile-processing-steps I don’t understand the need to the second barrier after computing tile-wise matrix multiplicaiton. My understanding is that barrier is needed when we want to synchronize threads before/after some shared resource access. However, accumulating into acc can be done in parallel since it is local to each thread and it can also be written to the output in parallel. What am I missing here?

        if tiled_row < size and tiled_col < size:

            @parameter
            for k in range(min(TPB, size - tile * TPB)):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

cudawarped · July 3, 2025, 2:05pm

Great question, the key is to think about what happens after the accumulation step.

If there are more tiles to be processed then the contents of a_shared and/or b_shared can be overwritten by any threads in the block which have finished the current accumulation stage, affecting the values read by those threads which are still partisipating in the current the accumulation stage.

e.g. The below accumulation step involving a shared memory load for tile == 0

local_row == thread_idx.y == 0
local_col == thread_idx.x == 0
acc += a_shared[local_row, k] * b_shared[k, local_col]

can be executed at the same time as the shared memory store for tile == 1

local_row == thread_idx.y == 1
local_col == thread_idx.x == 0
a_shared[local_row, local_col] = a[
    tiled_row, tile * TPB + local_col
]

ajinkya · July 4, 2025, 12:28am

Gotcha. If you unroll the loop, then you see why the second barrier is needed. There’s a read (dot product calculation) followed by a write (block loading calculation) and if we don’t synchronize before write then we could be reading wrong values.

Now, I am starting appreciate how difficult it is to write correct and fast GPU kernels. At least I don’t have to do it in C++ and CUDA thanks to Mojo.

Topic		Replies	Views
Questions regarding puzzle 14 GPU Programming	9	157	July 8, 2025
Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul Content blog	3	108	September 6, 2025
How do I sync threads between blocks i.e. device-wide? GPU Programming	7	172	August 14, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	67	October 16, 2025
Puzzle 23. Why use strided loading of tiles? GPU Programming gpu_puzzle	4	138	October 3, 2025

Tiled Matrix Multiplication Puzzle

Related topics