Puzzle 23. Why use strided loading of tiles?

Ehsan · October 3, 2025, 11:08pm

Great questions!

Firstly I’m really not sure about all the references to cache efficiency/utilization so maybe we should discuss this here before submitting any alterations.

As discussed, after making the alignment corrections (thanks for the quick PR ), the cache efficiency argument becomes invalid. So I’ll modify that after merging your PR and would like to include the bfloat16 results too to match the expected SIMD behavoir.

Secondly I may be missing the point but I think that by tiling/vectorizing you want to create a blocked arrangement where each thread has access to contiguous elements (spatial locality)

Ideally but clearly has shortcomings.

there is no inbuilt feature in mojo similar to cub::BlockExchange to do this in a coalesced manner is that correct?

Confirmed by the team! there’s none in Mojo.

I would suggest that we: …

I agree! the benchmarking discussion needs to be modified. I’d frame them around memory access pattern tradeoff and coalescing vs locality considerations.

I don’t know what to put for the rest. From a quick inspection it looks like:

Vectorize is slower than elementwise because of its uncolaesced memory access requesting twice as many sectors from L1.

Manual vectorize is slower still because the loop unrolling is reducing the efficiency of the cache futher. Removal of @parameter didn’t prevent the unrolling.

Tiling is slow because it makes 4 times as many requests to/from L1 as vectorize due to the lack of SIMD.

LGTM

Topic		Replies	Views
Questions regarding puzzle 14 GPU Programming	9	157	July 8, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	67	October 16, 2025
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	190	July 4, 2025
Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul Content blog	3	108	September 6, 2025
Question regarding Mojo SOTA Blackwell matmul part 2 blog: about TMA load GPU Programming blog	3	105	November 4, 2025

Puzzle 23. Why use strided loading of tiles?

Related topics