Examples of programming GPU functions using the Mojo MAX Driver API

BradLarson · February 26, 2025, 9:25pm

In the latest MAX nightly, we’ve added a new group of GPU programming examples: how to write Mojo functions that run on a GPU via the MAX Driver API. These show a programming model that may be very familiar to CUDA C programmers, defining and dispatching GPU functions within a single Mojo file. In fact, the initial examples recreate the first three CUDA samples from the popular textbook “Programming Massively Parallel Processors” in MAX to show how basic concepts translate from CUDA. And we threw in calculating the Mandelbrot set, because that’s just fun.

These examples do require a MAX-compatible GPU to build and run.

We’ve also published the API documentation for the Mojo max.driver module. One caution is that we do anticipate that the Mojo MAX Driver API will evolve before the next stable release. We also will be updating the API docs, so there may be some missing items right now. Treat these as experimental examples and interfaces, subject to rapid change.

Previously, we’d released examples that demonstrated how to program custom MAX Graph operations using Mojo. This is still the path that we use at Modular for building the nodes in complex computational graphs, like for AI models, and what we recommend for large-scale applications. The MAX graph compiler is extremely powerful and can optimize data-parallel code in ways that aren’t accessible when building outside of a graph.

These examples show the flexibility of the MAX GPU programming model, from single-file eager execution of GPU functions all the way to complex graphs of operations in a large language model. Working directly with the MAX Driver API can provide a great on-ramp for rapid prototyping of GPU code in Mojo that can then be placed inside a larger computational graph when ready. We’ll expand the examples in the near future to better illustrate this particular GPU development journey.

Try out these examples today using the latest MAX nightly and we’d love to hear your thoughts or questions about them!

owenhilyard · February 26, 2025, 10:41pm

For those looking at trying this, here’s lists of GPUs which are the same die as:

A10: NVIDIA GA102 GPU Specs | TechPowerUp GPU Database
A100: NVIDIA GA100 GPU Specs | TechPowerUp GPU Database
L4: NVIDIA AD104 GPU Specs | TechPowerUp GPU Database
L40: NVIDIA AD102 GPU Specs | TechPowerUp GPU Database

It’s mostly 30 and 40 series for your consumer GPU options since last I tested it 50 series does not work (that was on launch day). These are the ones most likely to work, but I can’t guarantee anything.

owenhilyard · February 26, 2025, 10:48pm

On a non-moderator note, the ability to compile kernels and call them without going through vendor-specific APIs is fantastic. I was going to ask for something similar to be exposed since I have a few places where the JIT compiler is of great use but I only want a single thread involved on the CPU side and going through MAX is a bit too much overhead.

The documentation is also very welcome.

BradLarson · March 8, 2025, 5:58pm

A quick update on these examples: we’ve moved them over to use the LayoutTensor interface for working with tensors inside GPU functions. Joe’s been adding a bunch of API docs for the layout module and more, and we’re working hard to document the advantages that LayoutTensor provides as an interface to multidimensional memory structures on GPUs.

arthur · April 15, 2025, 7:12pm

In addition to continuing to improve the layout API docs, in our latest nightly we’ve added a new intro to layouts. There’s also a new set of basic layout examples to accompany the doc.

We’ll continue to add more docs in this area, but in the meantime if you find any errors, or have requests for clarification, please file an issue so we can make it better for everyone.

BradLarson · April 26, 2025, 4:19pm

I’ll highlight that @jack.clayton has been doing a lot of work to unify our programming interfaces for how we allocate GPU memory and dispatch functions to GPUs. As a result, we’re moving from using the max.driver Mojo APIs in these examples to working directly with DeviceContext from the gpu module. The interfaces on DeviceContext have evolved to the point that max.driver no longer provides a better abstraction, and by unifying on this we can have the same logic inside of custom operations / kernels as with standalone Mojo examples like these.

All of the GPU function examples have been updated to use DeviceContext, and that has led to overall cleaner code in those examples. Check them out!

Topic		Replies	Views
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	3	229	April 4, 2025
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1132	April 9, 2025
Resources for learning MAX for non-ML developers MAX discussion , gpu , docs , 25_1	2	207	February 22, 2025
New GPU programming recipes GPU Programming gpu , modular-content	0	190	March 14, 2025
GPU Programming Manual Community Showcase gpu , docs , modular-content	17	486	March 26, 2025

Examples of programming GPU functions using the Mojo MAX Driver API

Related topics