Examples of custom CPU / GPU operations in Mojo

BradLarson · December 18, 2024, 6:25pm

We have a ton of stuff launching with MAX in 24.6, including a preview of GPU support. All of those new additions have great new docs and tutorials (for example, how to serve the Llama 3.1 model on GPUs at scale).

One undocumented and experimental capability that we’re sharing with the community is very early support for writing custom GPU operations in Mojo. We currently have two examples of this in the nightly branch of the MAX GitHub repository under examples/custom_ops: a very basic “hello world” sample that adds 1 to each element of a tensor, and a kernel that calculates the number of iterations to escape in the Mandelbrot set. Both use a Mojo API for defining custom operations, then they show how to construct simple computational graphs in Python and use these operations within them.

I’ll caution that these examples are very much subject to change or breakage as we work towards our next stable release, and have little to no documentation, but we wanted to provide an early preview of this capability to the MAX community. As people have already started to try out these GPU programming examples, I felt it would be useful to start a thread to discuss them and to gather issues and requests.

BradLarson · December 19, 2024, 4:44pm

One issue that was surfaced in a Discord discussion, and that I wanted to document here for reference: to use custom Mojo operations using our in-development extensibility API, you need to make sure the environment variable MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API is set to true. This is done for you as part of the Magic invocation:

[tasks]
addition = { cmd="mojo package kernels/ -o kernels.mojopkg && python addition.py", env={ MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API="true" }}

The approximate sequence of commands that would be used to build and run one of these examples if you were doing this at the shell level would be:

export MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API=true
mojo package kernels/ -o kernels.mojopkg
python addition.py

This activates the new extensibility API, builds the Mojo custom ops into a package, and then uses the ops within a new graph on the GPU or CPU.

BradLarson · January 11, 2025, 3:47pm

As a quick update on these examples, the latest nightly contains a series of improvements:

The addition example now shows a simpler way to create a single-operation graph in our Python API.
The Mandelbrot set example now calculates the initial value for the complex C inside the operation, avoiding a wasteful calculation and copy from the host. It shows how you can define an operation with no tensor inputs, but that produces a tensor output.
The latest nightly version of MAX no longer needs the environment variable MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API to be set to true for these custom operations to work.
The type for an accelerator Device in the Python Driver API in the latest nightlies has changed from CUDA to Accelerator to reflect the generalized nature of accelerator support in MAX.
We’re in the process of gradually adding API documentation for the Mojo custom operation types and functions used for GPU programming. These will appear first in the nightly documentation, such as the new entry for ManagedTensorSlice. Keep an eye on these nightly API docs for more.

BradLarson · January 25, 2025, 3:41pm

Note: we’re working on exposing a nightly changelog for MAX in the same way as has been done for Mojo. When that is live, you’ll be able to read these feature updates within the MAX changelog. Stay tuned!

The latest MAX nightly adds a lot of new content for GPU programming in MAX:

As a source-breaking change, ManagedTensorSlice and foreach have moved from the tensor_utils module to max.tensor.
Two new custom operation programming examples have been added that show more of the power that Mojo provides for programming GPUs:
- The vector_addition example demonstrates how to write device-specific codepaths, as well as how to manually dispatch Mojo functions on the GPU within a custom operation. This mode of programming may be much more familiar to those used to CUDA C programming. Note that the foreach abstraction performs elementwise calculations far more efficiently than the manual functions here, due to its hardware optimization. This is merely an instructive example.
- The top_k example shows a practical use case for a custom operation that is used today within large language model graphs: a top-K token sampler. The Mojo code contains a much more complex calculation, as well as how to construct a custom shape function for the operation. The Python-side code also hosts a showcase for how such an operation is used in practice.
The synchronous parameter has been removed from the interfaces in the custom operation examples. It was only useful in a few cases, and we’re evaluating removing it overall. That simplifies the overall operation interface.
Once the nightly docs update (the team is working hard on deploying these right now), initial API docs will appear for the gpu and layout modules and their dependencies.

BradLarson · February 6, 2025, 3:54pm

A couple more recent additions to the custom operations programming examples:

We added a GPU-specific path in the top-K token sampler custom operation example which demonstrates a few more GPU programming techniques in Mojo.
Custom operations can now be compile-time parameterized on user-supplied custom Int or StringLiteral parameters that you pass in via the Graph API. The parametric_addition example shows how this works, and I had a little fun with it when implementing image blend modes parameterized by the name of the blend function.

owenhilyard · February 6, 2025, 4:46pm

In the top-K example, var top_k_sram = external_memory[...]() is pulling data from the array with size given by shared_mem_bytes. Is there a reason to do it this way instead of allocating the buffer beforehand and passing it as an argument?

MMabrouk · February 6, 2025, 5:14pm

the work on custom-ops is really interesting. AFAIK you still need the MAX engine to compile/run/partition the computational graph. Any hope that we can run this graph from mojo soon?

jack.clayton · February 6, 2025, 5:36pm

Hi @owenhilyard shared_mem_bytes shares memory between every thread in a block, we don’t need to access that memory from the CPU host. It’d be wasted resources to allocate it on host especially if it’s a very large batch size in this example, as each batch gets it’s own block, so for a 1M size batch we’d need to allocate 100,000,000 * K * sizeof[T].

BradLarson · February 6, 2025, 5:40pm

MAX has a lot of interface levels from low to high: from Driver to Graph to Serve. Mojo is part of MAX as a language to define high-performance computation. So far, we’ve shown how to use Mojo to program GPUs within the Graph portion of the API, and you can see the use of the Driver to allocate memory on host, manage interactions with accelerators, and transfer memory between host and device.

There are both Python and Mojo interfaces to the Driver and Graph APIs, although the Python APIs have gotten the most attention recently for various reasons: progressive introduction into Python codebases, Mojo has changed a lot since these were first designed, etc. The Python interfaces provide zero-cost direct interoperability with PyTorch tensors and NumPy arrays, for example, which is huge for usability. When staging a computational graph, overhead from the host language tends to be minimal, so Python works well for graph computation.

We believe that moving computation into a graph is the best path for large-scale computation, like is done in ML models. However, we also recognize that there’s some overhead in doing so for setting up and compiling a graph. Without spoiling anything, we’re working on something in MAX that I think you’ll like and will satisfy your core question here.

owenhilyard · February 6, 2025, 5:47pm

I was imagining using cudaMalloc (or a mojo binding) to allocate the memory on the GPU and get a device memory pointer back, then you submit those 8 bytes as an argument to the kernel. This would let MAX reuse that buffer if multiple calls are made.

jack.clayton · February 6, 2025, 8:07pm

cudaMalloc allocates to global memory, and is accessible from any thread in any block, it’s not as fast as shared memory which is only shared by a single block

owenhilyard · February 6, 2025, 8:08pm

Aha, that makes sense, thank you.

arthur · February 10, 2025, 6:23pm

Just adding to Jack’s answer. In the custom ops examples, the input tensors are allocated in the Python code and moved to device (GPU) memory. So when you see the custom op accessing ManagedTensorSlice instances, it’s accessing a pointer to GPU global memory.

There are also low-level Mojo APIs for allocating GPU memory and copying to/from memory, including methods in gpu.memory to allocate and copy to/from shared memory. There are also methods in gpu.host.DeviceContext for allocating global memory buffers, but you’re less likely to need those in a custom op.

The driver API also provides methods for moving data to and from the GPU. We’re hoping to add API docs for the Mojo driver API soon.

owenhilyard · February 10, 2025, 7:04pm

Thanks for the extra context! I spend a fair amount of energy trying to design programs to do all of their allocation up front, so I’ll need to spend some time poking around there.

I’m also hoping to see some unified memory options in there at some point since DIGITS is rumored to have MI300A-style unified memory.

TilliFe · February 12, 2025, 11:22am

Re. GPU Sessions in Mojo API:

Is it already possible in the Mojo API of MAX (nightly) to create Graphs that execute on a GPU?

For example, I would like to have the following code translated from Python to Mojo:

def main():
    rows = 2
    columns = 3
    dtype = DType.float32

    graph = Graph(
        name="graph-with-single-sin-op",
        forward=lambda x: ops.sin(x),
        input_types=[TensorType(dtype, shape=[rows, columns])]
    )

    # Create session on GPU if available, otherwise use CPU.
    device = CPU() if accelerator_count() == 0 else Accelerator()
    session = InferenceSession(devices=[device],)
    model = session.load(graph)

    # creat inputs and execute model
    x_values = np.random.uniform(size=(rows, columns)).astype(np.float32)
    x = Tensor.from_numpy(x_values).to(device)
    result = model.execute(x)[0]
    assert isinstance(result, Tensor)
    result = result.to(CPU())

To clarify, I don’t target custom ops for now, just the existing max.ops. If this is already possible, how would that look like?

owenhilyard · February 12, 2025, 7:37pm

I think a direct translation looks something like this:

from max.graph import Graph
from max.graph import TensorType, Type
from sys.info import has_nvidia_gpu_accelerator
from max.driver import cpu_device, DeviceTensor
from max.driver._cuda import cuda_device
from max.engine import InferenceSession, SessionOptions
from tensor import TensorSpec
from max.graph.ops import sin
import random

fn build_graph(tensor_type: TensorType) raises -> Graph:
    var graph = Graph(
        name="graph-with-single-sin-op",
        in_types=List(Type(tensor_type)),
        out_types=List(Type(tensor_type)),
    )

    graph.output(sin(graph[0])) 

    return graph

fn main() raises:
    alias rows = 2
    alias columns = 3
    alias dtype = DType.float32
    alias tensor_type = TensorType(dtype, rows, columns)

    var graph = build_graph(tensor_type)
    var device = cuda_device() if has_nvidia_gpu_accelerator() else cpu_device() 
    var session = InferenceSession(SessionOptions(device))
    var model = session.load(graph)

    var input_tensor = DeviceTensor(TensorSpec(dtype, rows, columns), cpu_device(), None).to_tensor[dtype, 2]()
    
    var num_bytes = input_tensor.spec().num_elements()
    var ptr = input_tensor.unsafe_ptr()
    
    random.rand(ptr, num_bytes)

    var device_tensor = input_tensor.to_device_tensor().move_to(device)

    var result = model.execute(device_tensor)
    var output_tensor = result[0].take().to_device_tensor().move_to(cpu_device())

    var output = output_tensor.to_tensor[dtype, 2]()

    print(output)

No custom ops required since max.graph.ops.sin exists.

BradLarson · February 12, 2025, 7:41pm

The Mojo Graph and Driver APIs do support running graphs (with custom ops) on a GPU. I will warn that the Mojo Graph and Driver interfaces have diverged from the Python APIs, and may not have all the features that are present in the Python APIs.

With the latest MAX nightly, the following code should place a graph on an accelerator. For inputs and outputs, Driver Tensors are allocated (note: not a max.tensor Tensor) and placed on the accelerator, with the input being initialized on host and moved to accelerator and output being retrieved from the accelerator and moved to the host.

from max.driver import accelerator_device, cpu_device, Tensor, Device
from max.engine import InferenceSession, SessionOptions
from max.graph import Graph, TensorType, ops


def main():
    alias VECTOR_WIDTH = 6

    host_device = cpu_device()
    gpu_device = accelerator_device()

    graph = Graph(TensorType(DType.float32, VECTOR_WIDTH))
    result = ops.sin(graph[0])
    graph.output(result)

    options = SessionOptions(gpu_device)
    session = InferenceSession(options)
    model = session.load(graph)

    input_tensor = Tensor[DType.float32, 1]((VECTOR_WIDTH), host_device)
    for i in range(VECTOR_WIDTH):
        input_tensor[i] = 1.25

    results = model.execute(input_tensor^.move_to(gpu_device))
    output = results[0].take().to_device_tensor().move_to(host_device).to_tensor[DType.float32, 1]()
    print(output)

Owen and I collided on this, but one difference in our two examples is that we recently generalized the Mojo Driver interface to not be hardcoded to CUDA-compatible devices to start to align with the Python Driver API.

TilliFe · February 13, 2025, 8:36am

Works perfectly! Thank you so much.

TilliFe · February 13, 2025, 3:36pm

The Mojo API Reference about the MAX Driver Module used to be present on the Modular Webpage, but at the moment it seems to be removed both from the stable and nightly API. I’d be super glad to have it back!

joshpeterson · February 13, 2025, 3:49pm

Yeah, this was removed a few months ago to avoid confusion with the Python Driver API as that was brought up. We’re looking to publish the Mojo Driver API docs soon though. They will likely be in nightly docs first.

Topic		Replies	Views
Examples of programming GPU functions using the Mojo MAX Driver API MAX discussion , gpu , 25_1	5	319	April 26, 2025
Initial support for writing PyTorch custom ops in Mojo Python Interop gpu	0	88	May 29, 2025
New GPU programming recipes GPU Programming gpu , modular-content	0	178	March 14, 2025
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	3	203	April 4, 2025
Resources for learning MAX for non-ML developers MAX discussion , gpu , docs , 25_1	2	202	February 22, 2025

Examples of custom CPU / GPU operations in Mojo

Related topics