Supporting New Accelerators in Mojo: The Case of the AMD MI300X

I am currently taking a university seminar on Mojo. I’m struggling to answer the following question from my supervisor: What steps were required to support the first AMD GPU, the MI300X, in Mojo? You can see some things in the gpu package, but what else had to be done? More generally: what steps are required to support any new accelerator in Mojo? My supervisor asked me to answer this question in order to compare the amount of work needed to support new accelerators in Mojo to that required by an alternative AI compiler. I would really appreciate any help in answering these questions.

Hi Maximilian,

The work to support MI300X is ongoing, but I would summarize it as these general categories of work for any given piece of hardware:

  1. Runtime/compiler integration: we need to talk to a code generator (e.g. an LLVM backend) and a low level driver (e.g. boot, enumerate devices, submit kernel, copy data).

  2. We need to adapt our kernel library to work with the compiler. Many things are standardized here, so getting things working generally goes pretty smoothly with a high quality llvm backend.

  3. Enable tools: debuggers, profilers, platform features like printf, etc are all optional (but important) and take work. In the case of AMD for example, we reimplemented printing going all the way to low level interfaces in pure mojo to make sure we didn’t bring in opencl dependencies. We have a whitepaper and can share with the world if interesting.

  4. Performance: The biggest piece is unlocking the power of hardware, (e.g. novel tensor cores) and figuring out the performance characteristics of the chip. This highly varies based on the target silicon and how similar it is to what we already support. There is a lot of convergence in the design of many chips, but performance is never “done”. Mojo’s support for advanced parametric programming is a huge superpower for this work.

What I can tell you is that all of the above is many orders of magnitude less work than building an entire AI solution from scratch. #4 is generally the most work, and Modular doesn’t want to have to do all of it for all use cases :-). I’m excited about us open sourcing the kernels “real soon” now, because many folks can see what this looks like.

I hope this helps. Mojo and MAX aren’t “magic” so the cost above isn’t zero, but it is a major step forward for hardware enablement in my opinion. The cost is proportional to “how weird” the chip is (compared to other things that MAX already supports) so the cost goes down slowly over time.

-Chris

6 Likes

We have a whitepaper and can share with the world if interesting.

Yes please!

3 Likes

Hi Chris,

This is really helpful – thank you!

I have two follow-up questions regarding the performance category:

  1. How do you unlock the full potential of the hardware? What exactly needs to be done? I’m particularly curious about making use of features like tensor cores.
  2. How do you determine the performance characteristics of a chip?

To give you a bit more context: the alternative compiler I’m comparing against is not built from scratch – it’s at a much higher level than Mojo. Its backends include CUDA, OpenCL, and OpenMP. In my opinion, Mojo could become a much better backend for it once support for a similar range of accelerators is available.

Once again, if you could share any additional information or reading material on this topic, it would be greatly appreciated. I’ve already incorporated your first response into my findings.

I’m really looking forward to using Mojo’s GPU programming capabilities in my master’s thesis in a few months.

-Maximilian

Hi @Maximilian, please attend our (virtual) talk at GPUMode on Friday or our hackathon (in person) on May 10th and I’d be very happy to explain all of this.

@joe and @lukas could you share the AMD printf lessons learned doc when you get a chance? It’d be wonderful if you could just check the markdown into oss so other folks can benefit from your learnings.

-Chris

2 Likes