Puzzle 23. Why use strided loading of tiles?

cudawarped · October 1, 2025, 12:35pm

In Puzzle 23 the tiling approach is proposed to increase cache locality. This approach swaps the coalesced global memory access used in the elementwise approach for striped uncoalesced global memory access.

Is the point of this to front load the cost of reading from global memory?

i.e. The puzzle has two approaches:

Elementwise Access: Coalesced access with all reads going to global memory and no L1 cache usage. Every load instruction results in a request to global memory.
Tiled Access: For small problem sizes with a stride > 32 Bytes the first read instruction will pull all the data through the L1 cache. Every subsequent load instruction should be serviced by the L1 cache.

dunnoconnor · October 2, 2025, 1:23am

Hi @cudawarped! There are three advantages of tiled memory access pattern:

Fewer Threads, More Work: The tiled approach launches fewer threads (one per tile) compared to the elementwise approach (one per SIMD chunk). For a SIZE=1024 vector with TILE_SIZE=32, there are only 32 threads, each responsible for a contiguous 32-element block of memory.
Sequential Access: Within each of these tiles, a single thread processes all 32 elements sequentially. For a TILE_SIZE=32, a thread accesses positions [0:32], then the next thread handles [32:64], and so on.
Excellent Spatial Locality: This sequential access pattern means that each thread is working on a contiguous chunk of memory. This creates excellent spatial locality and makes the approach highly cache-friendly. The first load of a tile pulls a block of memory into the L1 cache, and subsequent reads within that tile are then serviced from the much faster cache, not from global memory.

There’s a breakdown of the use case for each pattern here:

Ehsan · October 3, 2025, 12:48am

That chapter is designed to teach different functional patterns. The tiling part is pedagogical similar to most of the puzzles. You’re correct that tiling doesn’t improve perf for the vector addition and changes the access pattern. The benchmarking part shows that and compares different functional patterns and provides practical guidance on when to use each approach.

Please keep submitting your excellent PRs if you think things aren’t clear or if something is missing

cudawarped · October 3, 2025, 1:43pm

@Ehsan Thank you for your kind reply, I really would like to make this one clearer but I’m struggling.

Firstly I’m really not sure about all the references to cache efficiency/utilization so maybe we should discuss this here before submitting any alterations.

Secondly I may be missing the point but I think that by tiling/vectorizing you want to create a blocked arrangement where each thread has access to contiguous elements (spatial locality) but there is no inbuilt feature in mojo similar to cub::BlockExchange to do this in a coalesced manner is that correct? If so I am not sure if this makes practical sence because the of the severe penalty of uncoalesed loading.

In the current benchmarking part of the solution, the timings are off as they are currently dominated by the cost of memory allocation. I have “fixed” this in this PR with the updated timings more closely reflecting those from ncu/nsys. My interpretation of the results is that for this puzzle:

We can’t make any conclusions for the small/medium problem sizes as they are just too small.
For the large problem size its not practical to recommend anything other than coalesced memory access. You gain some convenience from a blocked arrangement but without somethign similar to blockexchage the cost is too high.

I would suggest that we:

Remove any mention of cache efficiency/locality as this is misleading. We are increasing the use of the cache at the cost of performance.
Change the conclusions but apart from for elementwise which could be something like

Why elementwise wins:
Simple memory patterns memory coalescing leads to the transfer of the smallest number of memory sectors

I don’t know what to put for the rest. From a quick inspection it looks like:
- Vectorize is slower than elementwise because of its uncolaesced memory access requesting twice as many sectors from L1.
- Manual vectorize is slower still because the loop unrolling is reducing the efficiency of the cache futher. Removal of @parameter didn’t prevent the unrolling.
- Tiling is slow because it makes 4 times as many requests to/from L1 as vectorize due to the lack of SIMD.

Ehsan · October 3, 2025, 11:08pm

Great questions!

Firstly I’m really not sure about all the references to cache efficiency/utilization so maybe we should discuss this here before submitting any alterations.

As discussed, after making the alignment corrections (thanks for the quick PR ), the cache efficiency argument becomes invalid. So I’ll modify that after merging your PR and would like to include the bfloat16 results too to match the expected SIMD behavoir.

Secondly I may be missing the point but I think that by tiling/vectorizing you want to create a blocked arrangement where each thread has access to contiguous elements (spatial locality)

Ideally but clearly has shortcomings.

there is no inbuilt feature in mojo similar to cub::BlockExchange to do this in a coalesced manner is that correct?

Confirmed by the team! there’s none in Mojo.

I would suggest that we: …

I agree! the benchmarking discussion needs to be modified. I’d frame them around memory access pattern tradeoff and coalescing vs locality considerations.

I don’t know what to put for the rest. From a quick inspection it looks like:

Vectorize is slower than elementwise because of its uncolaesced memory access requesting twice as many sectors from L1.

Manual vectorize is slower still because the loop unrolling is reducing the efficiency of the cache futher. Removal of @parameter didn’t prevent the unrolling.

Tiling is slow because it makes 4 times as many requests to/from L1 as vectorize due to the lack of SIMD.

LGTM

Topic		Replies	Views
Questions regarding puzzle 14 GPU Programming	9	203	July 8, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	78	October 16, 2025
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	282	July 4, 2025
Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul Content blog	2	142	September 6, 2025
Question regarding Mojo SOTA Blackwell matmul part 2 blog: about TMA load GPU Programming blog	3	116	November 4, 2025

Puzzle 23. Why use strided loading of tiles?

Related topics