Puzzle 23. Why use strided loading of tiles?

Great questions!

Firstly I’m really not sure about all the references to cache efficiency/utilization so maybe we should discuss this here before submitting any alterations.

As discussed, after making the alignment corrections (thanks for the quick PR :slight_smile: ), the cache efficiency argument becomes invalid. So I’ll modify that after merging your PR and would like to include the bfloat16 results too to match the expected SIMD behavoir.

Secondly I may be missing the point but I think that by tiling/vectorizing you want to create a blocked arrangement where each thread has access to contiguous elements (spatial locality)

Ideally but clearly has shortcomings.

there is no inbuilt feature in mojo similar to cub::BlockExchange to do this in a coalesced manner is that correct?

Confirmed by the team! there’s none in Mojo.

I would suggest that we: …

I agree! the benchmarking discussion needs to be modified. I’d frame them around memory access pattern tradeoff and coalescing vs locality considerations.

I don’t know what to put for the rest. From a quick inspection it looks like:

  • Vectorize is slower than elementwise because of its uncolaesced memory access requesting twice as many sectors from L1.
  • Manual vectorize is slower still because the loop unrolling is reducing the efficiency of the cache futher. Removal of @parameter didn’t prevent the unrolling.
  • Tiling is slow because it makes 4 times as many requests to/from L1 as vectorize due to the lack of SIMD.

LGTM :+1:

1 Like