Amdgcn DPP instructions for warp communication

farnoy · June 17, 2025, 3:56pm

Can I get a review on this GPU-based benchmark?
amdgcn DPP `shuffle_xor(val, 1)` implementation in mojo · GitHub

I think this makes shuffle_xor(val, n) (where n >=0 and n < 4 ) about 3x faster on MI300X. Is this a good microbenchmark though and can I bench it on some real world reductions that use this kind of shuffle?

Measured on MI300X and RX 6900XT. I have no idea why RDNA is benefitting from a mix of VALU & LDS instructions. My hypothesis was that it’d be the CDNA GPU benefitting more from it since it can issue two instructions per clock seeing that mix. RDNA can only do one per cycle per SIMD AFAIK, so it has nothing to gain on the IPC front from a richer instruction mix.

The downside is that the shuffle offset/xormask must be an immediate value encoded in the instruction, so it’d take an unswitch() or wiring it up as a parameter through callsites that need it. I think many reductions would be compatible with that approach since they know the shape of the reduction ahead of time.

BradLarson · June 18, 2025, 3:00am

Would you be willing to make a PR to add this behavior to warp.mojo in the Mojo standard library? The team could then do an A/B performance test on MI300 with some of the CI infrastructure we have internally.

farnoy · June 18, 2025, 9:01am

Yes! I’ll refine my solution, use this method in warp reductions on supported targets, and then I’ll submit a PR for review and testing.

Topic		Replies	Views
`prefix_sum` incorrect results with `gpu.warp.prefix_sum` and `gpu.block.prefix_sum` Mojo gpu	3	86	May 7, 2025
Modular: Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon Content blog	1	23	May 30, 2025
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	3	216	April 4, 2025
Mojo manual gpu basics exercise does not compile GPU Programming 25_3	7	125	April 2, 2025
Supporting New Accelerators in Mojo: The Case of the AMD MI300X GPU Programming mojo-compiler	7	287	May 14, 2025

Amdgcn DPP instructions for warp communication

Related topics