Any chance we could get official support for NVIDIA’s Turing architecture (RTX 20xx series) in Mojo and MAX? These GPUs are still widely used by individuals, like myself, and small teams, and adding support would make MAX more accessible to a broader audience.
This is a situation where we are caught: we want to be conservative about what we commit to supporting, but aggressive about what we enable. I think this is already possible (maybe some cuda version limitations that are overly conservative). Are you interested in MAX enabling this architecture and doing your own kernel programming on this architecture or are you asking us to deep dive into supporting it for a wide range of use-cases?
To follow up on what Chris is asking, there’s a couple of levels of GPU support: the simple ability to support building Mojo code to target a specific GPU architecture, and the full kernel support for running complex models on a given architecture.
If you just want the ability to program GPUs using Mojo and basic Python MAX graphs, there’s good news: you may be able to add this support for Turing GPUs yourself. As of the last MAX nightly, the gpu module is now open-sourced within the Mojo standard library. It contains our basic support for targeting given architectures, and you can add on to that now.
You potentially could build a local custom Mojo standard library containing support for Turing (sm_75) GPUs. This would let you build Mojo code that runs on your RTX 20XX GPU. Architecture support is largely contained within the gpu/host/info.mojo source file. If you want an example of how to add a new architecture, here’s how we enabled basic GPU programming support for Jetson Orin devices. Some of the data structures have changed a bit since then, but you can get the general idea from that.
If you got that working locally and tested it on your GPU, I think we’d welcome a PR to add this support.
Beyond basic support, I can’t guarantee that our kernels will all work well on a Turing GPU, and they certainly won’t be tuned well for that GPU class out of the box. Stay tuned for more on how you might be able to tune the kernels used in MAX models for your specific GPU architecture.
Thanks for the explanation! I could try writing kernels for Turing GPUs when I learn enough. I’ll definitely play around with this and see what I can figure out.
So the kernels might not work well, but they would work? For example, if I want to simply run max-recipes, would I need to rebuild the standard library from source, rather than using the one installed with magic?
Without extending the standard library, either your Turing GPU won’t be recognized as a MAX-supported GPU, or you’ll get a Mojo compilation failure (“unable to satisfy a constraint: sm_75 not recognized as an architecture”, or something like that).
If you build a modified local version of the Mojo standard library with Turing GPUs added, you may be able to use that to run MAX models. However, none of these kernels have been tested on Turing, so architecture-specific differences may cause them to crash or otherwise not work correctly. Hard to tell what will happen without testing.
Hi Brad,
I think I have a little issue with the solution you showed, when I’m building mojo stdlib from source I get an error: unable to locate module 'layout' and I also was unable to find it in the repository. Does this mean I need to combine the library that has layout with the one I built from source somehow?
Sorry about that, we’re open-sourcing these in sequence and the layout module is in one of the next set of libraries to make it into the nightlies. Due to cross-module dependencies, there may be occasional challenges in getting these to build until all of them have landed. That hopefully won’t last more than a few days. I’ll check on this one in particular, since I know it is referenced in a few places.
Full MAX models like Llama architectures aren’t fully functional on this GPU yet, because some of the kernels need to be updated to account for the Tensor Cores, etc. on the Turing GPUs. We definitely welcome patches to help extend the relevant kernels to work well on Turing GPUs.