Can I get a review on this GPU-based benchmark?
amdgcn DPP `shuffle_xor(val, 1)` implementation in mojo · GitHub
I think this makes shuffle_xor(val, n)
(where n >=0 and n < 4
) about 3x faster on MI300X. Is this a good microbenchmark though and can I bench it on some real world reductions that use this kind of shuffle?
Measured on MI300X and RX 6900XT. I have no idea why RDNA is benefitting from a mix of VALU & LDS instructions. My hypothesis was that it’d be the CDNA GPU benefitting more from it since it can issue two instructions per clock seeing that mix. RDNA can only do one per cycle per SIMD AFAIK, so it has nothing to gain on the IPC front from a richer instruction mix.
The downside is that the shuffle offset/xormask must be an immediate value encoded in the instruction, so it’d take an unswitch() or wiring it up as a parameter through callsites that need it. I think many reductions would be compatible with that approach since they know the shape of the reduction ahead of time.