How exactly does Modular / Mojo extend beyond CPU and GPU

I’d just read through the “Democratizing AI Compute” blog post series. Part 1 to 9 are amazing, part 10 is well written but I fee like it is a bit hand-waving on the “killer feature” of Modular / Mojo’s solution to all the problems pointed out in previous posts. Especially on how Modular / Mojo is going to extend beyond GPU.

For starter, democratizing AI computes means that “all kinds of people like researchers and framework developers can write code that taps into the full power of hardware, rather than just hardware vendors and compiler gurus”. Take tiling for example, from reading the standard library as well as Max’s source code, I can see that you can parameterize the desire tile size according to the hardware you’re targeting.

But the way it abstracts different hardware configurations seems to be… a bunch of if/else statement of things like is_nvidia_gpu? And for many operators like attention, this if is_nvidia_gpu (or if is_cpu) happens very high level before the code path diverges into different functions that don’t even share the same function name. So if we’re going to add, say XeGPU support tomorrow, are we going to do a similar thing and diverge to different code path (e.g. if is_xegpu(): run_xegpu_attention()) fairly high level?

More broadly: how exactly does Modular / Mojo scale to more hardware? Is there a plugin system or at least a unified interface to add a new hardware?

Many people drew analogy of Modular / Mojo with LLVM, which supports a wide array of different hardware. But I’m pretty sure LLVM’s hardware abstraction is more than just if is_nvidia_gpu at some high-level operator / instruction. For one, it has a whole infrastructure to register a new target backend which explicitly tells you what functionality you should implement or what feature you should provide. Things like TargetLoweringInfo provide a unified interface for backend compiler passes to use, while TargetTransformInfo provide another unified interface for middle-end compiler passes. This abstraction is not perfect and there are certainly some hardcoded is_this_target_foo but majority of the infrastructure centralize on these abstractions.

I don’t really see a similar abstraction design in Max or Mojo’s standard library, other than some hardware configurations sprinkling here and there. Am I missing something?

Or am I thinking on the wrong items and this hardware abstraction only happens in the Mojo compiler and the Max kernels / standard library only provide building blocks for GPUs and CPUs?

You can see the start of an abstraction layer here. Initially, for the 4 targets that Mojo and MAX had (x86, arm, nvidia, amd), it was easier to just use if statements, but you’re correct that that approach doesn’t scale. There’s some other stuff you can see as well, that’s either in the runtime or in-flight, but rest assured there is little desire to manage a tree of if statements containing every relevant piece of hardware for the next 30 years. In order to build a good abstraction, there first needs to be a good understanding of what needs to be abstracted over, and what capabilities exposed. We’re starting to get to that place now.

thanks, this looks pretty promising!

I just skimmed through it, and it seems like it focuses on device driver (e.g. memory management and kernel compilation) at this moment, is there any plan to extend this HAL interface to more low-level hardware detailed like preferred vector size and preferred tiling size? Or will that be another interface?

Preferred vector size is exposed via LLVM attributes you can already query, although it’s not wired up to the stdlib yet. Tiling sizes are situation and kernel dependent, so you can’t really do much there aside from look at the hardware tile sizes, which requires some information that isn’t yet plumbed through the HAL. My hope is that eventually you will get hardware agnostic version of a tuning guide/architecture manual to metaprogram with, but that’s a ways off.

Hi Min,

Thank you for the comments, I’m glad you enjoyed the posts - I should really make time to get back to writing more. :slight_smile:

The goal of enabling access to diverse hardware is a big one, and there isn’t “one easy way” to solve this problem - as the posts point out, many have tried. We’re taking a layered approach, where abstractions at all levels, and bring in newer technologies (eg graph compilers) that are familiar from AI systems, but that many sw engineers aren’t familiar with.

Each of our contributions is useful-but-not-groundbreaking in its own right, but when it all stacks together, it enables us to bring up new hardware rapidly, and deliver peak performance on existing hardware. This is made possible by our vertically-integrated and “designed together” stack, which no-one has done before.

We’ll have more to share about this over time, but you can check out some of our tech talks at the LLVM developer meetings ( The LLVM Compiler Infrastructure Project ) that cover some of the different components of that work.

-Chris