Supporting New Accelerators in Mojo: The Case of the AMD MI300X

Hi Maximilian,

The work to support MI300X is ongoing, but I would summarize it as these general categories of work for any given piece of hardware:

  1. Runtime/compiler integration: we need to talk to a code generator (e.g. an LLVM backend) and a low level driver (e.g. boot, enumerate devices, submit kernel, copy data).

  2. We need to adapt our kernel library to work with the compiler. Many things are standardized here, so getting things working generally goes pretty smoothly with a high quality llvm backend.

  3. Enable tools: debuggers, profilers, platform features like printf, etc are all optional (but important) and take work. In the case of AMD for example, we reimplemented printing going all the way to low level interfaces in pure mojo to make sure we didn’t bring in opencl dependencies. We have a whitepaper and can share with the world if interesting.

  4. Performance: The biggest piece is unlocking the power of hardware, (e.g. novel tensor cores) and figuring out the performance characteristics of the chip. This highly varies based on the target silicon and how similar it is to what we already support. There is a lot of convergence in the design of many chips, but performance is never “done”. Mojo’s support for advanced parametric programming is a huge superpower for this work.

What I can tell you is that all of the above is many orders of magnitude less work than building an entire AI solution from scratch. #4 is generally the most work, and Modular doesn’t want to have to do all of it for all use cases :-). I’m excited about us open sourcing the kernels “real soon” now, because many folks can see what this looks like.

I hope this helps. Mojo and MAX aren’t “magic” so the cost above isn’t zero, but it is a major step forward for hardware enablement in my opinion. The cost is proportional to “how weird” the chip is (compared to other things that MAX already supports) so the cost goes down slowly over time.

-Chris

6 Likes