Calling all AMD RDNA users: help us bring full MAX support to your GPUs!

Hi everyone :waving_hand:

You might remember that with our 25.4 release, we officially brought full support for AMD’s powerful data center GPUs (MI300X and MI325X) to our MAX and Mojo platforms. These CDNA3-based GPUs are in our tier 1 of support: fully supported across MAX and Mojo and tested regularly in our CI systems.

But we’ve also been quietly working on something many of you have been asking for: better support for consumer-grade AMD RDNA GPUs.

Thanks to amazing community help, we’ve already made progress! You can now do general Mojo GPU programming on RDNA3 and RDNA4 GPUs, including the integrated Radeon 700M series and discrete Radeon RX 7000 and 9000 series cards.

However, we’re not quite there yet for full MAX model support on RDNA GPUs.

Here’s the gist: Many of our core Mojo kernels on AMD were originally built with CDNA GPUs in mind and aren’t yet compatible with RDNA GPUs, meaning that if you try to run a MAX model on an RDNA GPU, it will likely fail to compile. There are a few important architectural differences between CDNA and RDNA GPUs, including:

  • Wavefront size: RDNA GPUs can use either 32 or 64-wide wavefronts (the equivalent of CUDA warps: think of these as groups of threads working together), while CDNA GPUs only use 64-wide wavefronts. By default, RDNA uses 32-wide, and specific flags need to be enabled for them to use 64-wide wavefronts.
  • Matrix Cores: CDNA GPUs have dedicated matrix cores, while RDNA GPUs do not. CDNA GPUs use MFMA intrinsics to access these matrix cores, and while RDNA GPUs have WMMA instructions for accelerating matrix multiplication, they do not map directly to the MFMA instructions.

What have we done so far to support RDNA?

Since the 25.4 release, we’ve added some key improvements:

  • We’ve adjusted the default wavefront size (WARP_SIZE in the Mojo standard library) to 32 for RDNA GPUs.
  • We’ve added special functions (_is_amd_rdna(), _is_amd_rdna3(), and _is_amd_rdna4()) to allow for specialization of kernel code for RDNA GPUs.
  • We’ve also started adding WMMA intrinsics for RDNA GPU matrix multiplication.

We need your help to get to the finish line!

There’s still a lot of work to do to get MAX models running at peak performance on RDNA GPUs. This is where you, our awesome community members with RDNA 3 and newer GPUs, can make a huge difference!

Where to start: the flash attention kernel

Currently, one of the biggest roadblocks to running MAX models on RDNA is our flash attention kernel. It was designed for CDNA GPUs and makes assumptions about the shape of operations that don’t hold true for RDNA.

If you have an RDNA 3+ GPU and are interested in diving into kernel development, you can run the following command to build and test the Mojo standard library and kernels, and then run tests against our AMD flash attention kernel:

./bazelw test //max/kernels/test/gpu/nn:test_flash_attention_amd.mojo.test --test_output=all

This command currently fails on RDNA GPUs with a constraint error due to incompatible fragment sizes. (Note: due to a bug that we’re fixing in our Bazel GPU lookups, you may also need to edit this line in common.MODULE.bazel from “780M” to “radeon” to run this test. This is being fixed.)

Your mission: We need to add RDNA-specific code paths within our kernels that provide the correct shapes to the RDNA’s WMMA intrinsics, and port anything else needed from CDNA to RDNA. The goal is for this test to pass cleanly on RDNA 3 and 4 GPUs, just like it does on CDNA today. We of course still need to preserve all of our current CDNA support while doing so.

Everything you need to get started is in our open-source modular repository! You can dig in, experiment, and test your changes directly on your RDNA GPU. There are also other areas within our codebase that need RDNA-specific attention, so if you’re feeling adventurous, you can explore other failing GPU tests on RDNA.

Your contributions would be incredibly valuable! For anyone who makes a contribution in this area that gets merged, we’re sending out an awesome Mojo and MAX branded gamer pad. You’ll also be directly contributing to a system that countless fellow AMD GPU users would benefit from every day! :rocket:

Hi @BradLarson, could you share an update what community members have been able to contribute on RDNA 3/4 front? I see fixes for _mma_wmma_rdna got merged in, but not sure if there are any other relevant PRs.

There have been some small fixes internally that have moved the needle forward slightly on AMD RDNA GPUs like this one, but we haven’t had tremendous engagement on other RDNA additions. That’s mostly my fault, I need to get a list together of failing tests and other targeted smaller enhancements that’ll incrementally get us on a path to complete model support on RDNA GPUs. The original solicitation above is a bit broad, and it’s a lot to try to tackle RDNA bringup as general area.

I’ll try to get a list of concrete tasks posted as issues on GitHub for people to attack, which I think will be much easier to manage.

I’ll try to get a list of concrete tasks posted as issues on GitHub for people to attack

oh nice. I’m new to Mojo / GPU programming and currently half way through GPU puzzles. Once I’m done with those, I’ll probably be interested in contributing to the project. I have a 9070 and general interest in what Modular is doing.

”make flash attention work” is a bit too open-ended for me, but if there is a lower hanging fruit to make RDNA4 work better, I’d want to take stab at it

Brad,

I am interested in getting started contributing. Have over a decade of coding experience, primarily in satellite communications realms. I’ve never coded for GPU use (the DSP guys do CUDA) but I have been looking at self-hosting an LLM for myself and kiddos, and running into all kinds of compatibility issues in another stack.

Have you see this? It might help in getting you further along: [Draft] [Preview] Support gfx1201 by tjtanaa · Pull Request #1681 · ROCm/aiter · GitHub

Also, if it helps for some background: [Feature]: Someone please upstream this gfx1201/RDNA4 FP8 Patch into vllm-rocm · Issue #28649 · vllm-project/vllm · GitHub

My experience so far attempting to glue together new versions of supporting libs with fixes was that the way a lot of the supporting libraries are put together with each other is quite fragile.

Hi Brad,

I’ve recently found myself with an extended amount of free time, and would be interested learning and helping. I’m come from ~30yrs games industry experience spanning strictly raster to the modern era, shader languages, CUDA, and a smattering of OpenCL and C++Amp (if you recall that). Started learning Mojo earlier this year.

Hardware I have available:

  • RNDA4, Proxmox setup w/ dual R9700
  • RNDA3, spare 7900XTX w/ eGPU harness
  • Nvidia RTX Pro 6000
  • Nvidia RTX 3090
  • MacBook Pro M4 Pro Max w/ 128GB ram

Both OSS and the specific sort of optimizations for AI are new to me, but I’m willing and able to invest the time and effort to help.

Thanks for the offer!

One thing that’s extremely useful is in testing on various architectures to see where things fail. Often, we’re missing a comptime if or something simple, but we didn’t have the hardware available to make that obvious. Specifically on RDNA 3 / 4, that hardware can be a challenge to rent in the cloud, so you need something local to verify models and other code against. If you try out various models and encounter hardware-specific issues with them, GitHub issues are much appreciated.

For RDNA4 in particular, we don’t have device-specific optimizations for the new WMMA additions they provide, especially when it comes to float8 datatypes. I have RDNA3 and RDNA3.5 hardware that I’ve done some initial tuning on, but not RDNA4. If you wanted to get into kernels, you could look at what we have so far and see if you could extend it with the new RDNA4 WMMA extensions to push performance further on that platform.

For NVIDIA, one big opportunity for optimization is on sm_120 consumer Blackwell hardware. We don’t yet have dedicated Mojo matrix multiplications or kernels like 2-D convolution on that specific architecture. Your RTX Pro 6000 could prove to be a good test platform, if you wanted to try to develop a Mojo matmul specific to sm_120 / sm_121, as well as look at native NVFP4 support. sm_120 hardware is significantly different from the sm_100 Blackwell we’ve optimized for the data center, and good Mojo kernels for that architecture could be very helpful for developers running models locally on these GPUs.

Ha, also realized I never came back and updated this thread to note that the original goal has been achieved and we can indeed run MAX models on AMD RDNA GPUs today. I’ve also been hacking on some enhancements to matmul and 2-D convolution for RDNA 3+ GPUs that I mention above, which have significantly improved performance over our initial naive implementations of those kernels. Models like FLUX.2-klein actually run fairly well locally on an AMD Strix Halo system (Framework Desktop) using MAX in our latest nightlies.


@BradLarson Great timing on this update!


I was just about to buy a GMKtec AMD Strix Halo mini-PC (96GB unified RAM for ~$2,000) as a much more affordable alternative to dropping $4,000+ on a Mac Mini. My goal is to use this machine to integrate MAX as the backend inference engine for OpenClaw / Hermes to run local autonomous agents. Before I start setting this up, do you have any specific tips, tricks, or known quirks I should watch out for when deploying MAX on a Strix Halo system?

One quirk I’ll jut in with is that AMD has a very weird compatibility matrix and that Mojo and MAX use some parts of ROCm (such as “launch kernel”), so you may need to follow that matrix. Alternatively, doing a build of TheRock with Strix Halo enabled should be sufficient.

I’ll caution that we’re still early days, and I maybe wouldn’t make purchasing decisions for hardware based solely on current MAX support. It’s evolving rapidly across the various large shared-RAM consumer systems: Apple silicon GPUs, NVIDIA GB10, and AMD Strix Halo. Each system has advantages and disadvantages for faster computation vs. faster RAM access vs. price, and so on.

One big item I’ll call out is that we’re still getting the conventions right around shared-RAM systems, so our memory usage isn’t the best on these. For Strix Halo, you currently need to set the environment variable MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE=0to not fail on RAM size checks when serving a larger model, as one example.

We also do have occasional breakages in kernels as we add capabilities on the CDNA side, we’re trying to catch those as they occur, but you may see temporary kernel failures on RDNA in certain models. Feel free to file issues as you encounter them, that helps with testing and identification.

Thanks @BradLarson I will try this AMD purchase from amazon for 30 days. If its frictionless, then will continue else switch to DGX spark.

Any approx time on MAX inference for GEN AI models on MAC MINI / AMD RDNA? Fall?

I believe we have pretty good coverage for serving of MAX GenAI models (that are sized appropriately) on AMD RDNA GPUs today. Regarding Apple silicon GPUs, I don’t want to promise any specific timeline. We’re close to getting models operational, many smaller subgraphs work on these GPUs today, but we’re tracking down some last compilation issues with specific kernels and operations that appear in GenAI models.

That’s very encouraging! I want to try this on AMD box, and run openclaw and check out the performance. lllama.cpp runs everywhere and it will be good to have MAX as an option.

I just pulled the latest main and ./bazelw test //max/kernels/test/gpu/nn:test_flash_attention_amd.mojo.test is failing for me on RDNA4.
But, for example, //max/kernels/test/gpu/nn:test_layer_norm.mojo.test test passes fine.
error:

LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f32.16x16x16.bf16
Please submit a bug report to https://github.com/modular/modular/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Running pass 'CallGraph Pass Manager' on module '<split-module>'.
1.      Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@nn_attention_gpu_mha_mha_D6A6A6A6A6A6A6A6AcB6A6A_99c8734ba792b877'

changes to common.MODULE.bazel that I had to make:

diff --git a/bazel/common.MODULE.bazel b/bazel/common.MODULE.bazel
index 5858eeaa21..a5e853b762 100644
--- a/bazel/common.MODULE.bazel
+++ b/bazel/common.MODULE.bazel
@@ -289,6 +289,7 @@ mojo.gpu_toolchains(
         "AMD Radeon RX 6900 XT": "rx6900xt",
         "AMD Radeon PRO W7900": "W7900",
         "AMD Radeon Pro W7900": "W7900",
+        "Radeon RX 9070": "rx9070",
         "Phoenix3": "780M",
         "Strix Halo": "strixhalo",
         "Metal 1": "",  # Unsupported, macOS updates on any hardware get you to Metal 3+
@@ -321,6 +322,7 @@ mojo.gpu_toolchains(
         "rtx5090": "nvidia:sm_120a",
         "gb10": "nvidia:sm_121",
         "rx6900xt": "amdgpu:gfx1030",
+        "rx9070": "amdgpu:gfx1201",
         "strixhalo": "amdgpu:gfx1151",
         "metal3": "metal:3",
         "metal4": "metal:4",

It’s possible that we need the gfx12 version of that intrinsic for RDNA4. I had hoped that the RDNA3 intrinsics would be forward-compatible, but that may not be the case. Don’t have an RDNA4 system of my own to test on, unfortunately.