Examples of programming GPU functions using the Mojo MAX Driver API

In the latest MAX nightly, we’ve added a new group of GPU programming examples: how to write Mojo functions that run on a GPU via the MAX Driver API. These show a programming model that may be very familiar to CUDA C programmers, defining and dispatching GPU functions within a single Mojo file. In fact, the initial examples recreate the first three CUDA samples from the popular textbook “Programming Massively Parallel Processors” in MAX to show how basic concepts translate from CUDA. And we threw in calculating the Mandelbrot set, because that’s just fun.

These examples do require a MAX-compatible GPU to build and run.

We’ve also published the API documentation for the Mojo max.driver module. One caution is that we do anticipate that the Mojo MAX Driver API will evolve before the next stable release. We also will be updating the API docs, so there may be some missing items right now. Treat these as experimental examples and interfaces, subject to rapid change.

Previously, we’d released examples that demonstrated how to program custom MAX Graph operations using Mojo. This is still the path that we use at Modular for building the nodes in complex computational graphs, like for AI models, and what we recommend for large-scale applications. The MAX graph compiler is extremely powerful and can optimize data-parallel code in ways that aren’t accessible when building outside of a graph.

These examples show the flexibility of the MAX GPU programming model, from single-file eager execution of GPU functions all the way to complex graphs of operations in a large language model. Working directly with the MAX Driver API can provide a great on-ramp for rapid prototyping of GPU code in Mojo that can then be placed inside a larger computational graph when ready. We’ll expand the examples in the near future to better illustrate this particular GPU development journey.

Try out these examples today using the latest MAX nightly and we’d love to hear your thoughts or questions about them!

4 Likes

For those looking at trying this, here’s lists of GPUs which are the same die as:

A10: NVIDIA GA102 GPU Specs | TechPowerUp GPU Database
A100: NVIDIA GA100 GPU Specs | TechPowerUp GPU Database
L4: NVIDIA AD104 GPU Specs | TechPowerUp GPU Database
L40: NVIDIA AD102 GPU Specs | TechPowerUp GPU Database

It’s mostly 30 and 40 series for your consumer GPU options since last I tested it 50 series does not work (that was on launch day). These are the ones most likely to work, but I can’t guarantee anything.

2 Likes

On a non-moderator note, the ability to compile kernels and call them without going through vendor-specific APIs is fantastic. I was going to ask for something similar to be exposed since I have a few places where the JIT compiler is of great use but I only want a single thread involved on the CPU side and going through MAX is a bit too much overhead.

The documentation is also very welcome.

1 Like

A quick update on these examples: we’ve moved them over to use the LayoutTensor interface for working with tensors inside GPU functions. Joe’s been adding a bunch of API docs for the layout module and more, and we’re working hard to document the advantages that LayoutTensor provides as an interface to multidimensional memory structures on GPUs.