Great questions!
Firstly I’m really not sure about all the references to cache efficiency/utilization so maybe we should discuss this here before submitting any alterations.
As discussed, after making the alignment corrections (thanks for the quick PR
), the cache efficiency argument becomes invalid. So I’ll modify that after merging your PR and would like to include the bfloat16 results too to match the expected SIMD behavoir.
Secondly I may be missing the point but I think that by tiling/vectorizing you want to create a blocked arrangement where each thread has access to contiguous elements (spatial locality)
Ideally but clearly has shortcomings.
there is no inbuilt feature in mojo similar to
cub::BlockExchangeto do this in a coalesced manner is that correct?
Confirmed by the team! there’s none in Mojo.
I would suggest that we: …
I agree! the benchmarking discussion needs to be modified. I’d frame them around memory access pattern tradeoff and coalescing vs locality considerations.
I don’t know what to put for the rest. From a quick inspection it looks like:
- Vectorize is slower than elementwise because of its uncolaesced memory access requesting twice as many sectors from L1.
- Manual vectorize is slower still because the loop unrolling is reducing the efficiency of the cache futher. Removal of @parameter didn’t prevent the unrolling.
- Tiling is slow because it makes 4 times as many requests to/from L1 as vectorize due to the lack of SIMD.
LGTM ![]()