I’d just read through the “Democratizing AI Compute” blog post series. Part 1 to 9 are amazing, part 10 is well written but I fee like it is a bit hand-waving on the “killer feature” of Modular / Mojo’s solution to all the problems pointed out in previous posts. Especially on how Modular / Mojo is going to extend beyond GPU.
For starter, democratizing AI computes means that “all kinds of people like researchers and framework developers can write code that taps into the full power of hardware, rather than just hardware vendors and compiler gurus”. Take tiling for example, from reading the standard library as well as Max’s source code, I can see that you can parameterize the desire tile size according to the hardware you’re targeting.
But the way it abstracts different hardware configurations seems to be… a bunch of if/else statement of things like is_nvidia_gpu? And for many operators like attention, this if is_nvidia_gpu (or if is_cpu) happens very high level before the code path diverges into different functions that don’t even share the same function name. So if we’re going to add, say XeGPU support tomorrow, are we going to do a similar thing and diverge to different code path (e.g. if is_xegpu(): run_xegpu_attention()) fairly high level?
More broadly: how exactly does Modular / Mojo scale to more hardware? Is there a plugin system or at least a unified interface to add a new hardware?
Many people drew analogy of Modular / Mojo with LLVM, which supports a wide array of different hardware. But I’m pretty sure LLVM’s hardware abstraction is more than just if is_nvidia_gpu at some high-level operator / instruction. For one, it has a whole infrastructure to register a new target backend which explicitly tells you what functionality you should implement or what feature you should provide. Things like TargetLoweringInfo provide a unified interface for backend compiler passes to use, while TargetTransformInfo provide another unified interface for middle-end compiler passes. This abstraction is not perfect and there are certainly some hardcoded is_this_target_foo but majority of the infrastructure centralize on these abstractions.
I don’t really see a similar abstraction design in Max or Mojo’s standard library, other than some hardware configurations sprinkling here and there. Am I missing something?
Or am I thinking on the wrong items and this hardware abstraction only happens in the Mojo compiler and the Max kernels / standard library only provide building blocks for GPUs and CPUs?