One undocumented and experimental capability that we’re sharing with the community is very early support for writing custom GPU operations in Mojo. We currently have two examples of this in the nightly branch of the MAX GitHub repository under examples/custom_ops: a very basic “hello world” sample that adds 1 to each element of a tensor, and a kernel that calculates the number of iterations to escape in the Mandelbrot set. Both use a Mojo API for defining custom operations, then they show how to construct simple computational graphs in Python and use these operations within them.
I’ll caution that these examples are very much subject to change or breakage as we work towards our next stable release, and have little to no documentation, but we wanted to provide an early preview of this capability to the MAX community. As people have already started to try out these GPU programming examples, I felt it would be useful to start a thread to discuss them and to gather issues and requests.
One issue that was surfaced in a Discord discussion, and that I wanted to document here for reference: to use custom Mojo operations using our in-development extensibility API, you need to make sure the environment variable MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API is set to true. This is done for you as part of the Magic invocation:
As a quick update on these examples, the latest nightly contains a series of improvements:
The addition example now shows a simpler way to create a single-operation graph in our Python API.
The Mandelbrot set example now calculates the initial value for the complex C inside the operation, avoiding a wasteful calculation and copy from the host. It shows how you can define an operation with no tensor inputs, but that produces a tensor output.
The latest nightly version of MAX no longer needs the environment variable MODULAR_ONLY_USE_NEW_EXTENSIBILITY_API to be set to true for these custom operations to work.
The type for an accelerator Device in the Python Driver API in the latest nightlies has changed from CUDA to Accelerator to reflect the generalized nature of accelerator support in MAX.
We’re in the process of gradually adding API documentation for the Mojo custom operation types and functions used for GPU programming. These will appear first in the nightly documentation, such as the new entry for ManagedTensorSlice. Keep an eye on these nightly API docs for more.
Note: we’re working on exposing a nightly changelog for MAX in the same way as has been done for Mojo. When that is live, you’ll be able to read these feature updates within the MAX changelog. Stay tuned!
The latest MAX nightly adds a lot of new content for GPU programming in MAX:
As a source-breaking change, ManagedTensorSlice and foreach have moved from the tensor_utils module to max.tensor.
The vector_addition example demonstrates how to write device-specific codepaths, as well as how to manually dispatch Mojo functions on the GPU within a custom operation. This mode of programming may be much more familiar to those used to CUDA C programming. Note that the foreach abstraction performs elementwise calculations far more efficiently than the manual functions here, due to its hardware optimization. This is merely an instructive example.
The top_k example shows a practical use case for a custom operation that is used today within large language model graphs: a top-K token sampler. The Mojo code contains a much more complex calculation, as well as how to construct a custom shape function for the operation. The Python-side code also hosts a showcase for how such an operation is used in practice.
The synchronous parameter has been removed from the interfaces in the custom operation examples. It was only useful in a few cases, and we’re evaluating removing it overall. That simplifies the overall operation interface.
Once the nightly docs update (the team is working hard on deploying these right now), initial API docs will appear for the gpu and layout modules and their dependencies.
Custom operations can now be compile-time parameterized on user-supplied custom Int or StringLiteral parameters that you pass in via the Graph API. The parametric_addition example shows how this works, and I had a little fun with it when implementing image blend modes parameterized by the name of the blend function.
In the top-K example, var top_k_sram = external_memory[...]() is pulling data from the array with size given by shared_mem_bytes. Is there a reason to do it this way instead of allocating the buffer beforehand and passing it as an argument?
the work on custom-ops is really interesting. AFAIK you still need the MAX engine to compile/run/partition the computational graph. Any hope that we can run this graph from mojo soon?
Hi @owenhilyardshared_mem_bytes shares memory between every thread in a block, we don’t need to access that memory from the CPU host. It’d be wasted resources to allocate it on host especially if it’s a very large batch size in this example, as each batch gets it’s own block, so for a 1M size batch we’d need to allocate 100,000,000 * K * sizeof[T].
MAX has a lot of interface levels from low to high: from Driver to Graph to Serve. Mojo is part of MAX as a language to define high-performance computation. So far, we’ve shown how to use Mojo to program GPUs within the Graph portion of the API, and you can see the use of the Driver to allocate memory on host, manage interactions with accelerators, and transfer memory between host and device.
There are both Python and Mojo interfaces to the Driver and Graph APIs, although the Python APIs have gotten the most attention recently for various reasons: progressive introduction into Python codebases, Mojo has changed a lot since these were first designed, etc. The Python interfaces provide zero-cost direct interoperability with PyTorch tensors and NumPy arrays, for example, which is huge for usability. When staging a computational graph, overhead from the host language tends to be minimal, so Python works well for graph computation.
We believe that moving computation into a graph is the best path for large-scale computation, like is done in ML models. However, we also recognize that there’s some overhead in doing so for setting up and compiling a graph. Without spoiling anything, we’re working on something in MAX that I think you’ll like and will satisfy your core question here.
I was imagining using cudaMalloc (or a mojo binding) to allocate the memory on the GPU and get a device memory pointer back, then you submit those 8 bytes as an argument to the kernel. This would let MAX reuse that buffer if multiple calls are made.
cudaMalloc allocates to global memory, and is accessible from any thread in any block, it’s not as fast as shared memory which is only shared by a single block
Just adding to Jack’s answer. In the custom ops examples, the input tensors are allocated in the Python code and moved to device (GPU) memory. So when you see the custom op accessing ManagedTensorSlice instances, it’s accessing a pointer to GPU global memory.
There are also low-level Mojo APIs for allocating GPU memory and copying to/from memory, including methods in gpu.memory to allocate and copy to/from shared memory. There are also methods in gpu.host.DeviceContext for allocating global memory buffers, but you’re less likely to need those in a custom op.
The driver API also provides methods for moving data to and from the GPU. We’re hoping to add API docs for the Mojo driver API soon.
Thanks for the extra context! I spend a fair amount of energy trying to design programs to do all of their allocation up front, so I’ll need to spend some time poking around there.
I’m also hoping to see some unified memory options in there at some point since DIGITS is rumored to have MI300A-style unified memory.
I think a direct translation looks something like this:
from max.graph import Graph
from max.graph import TensorType, Type
from sys.info import has_nvidia_gpu_accelerator
from max.driver import cpu_device, DeviceTensor
from max.driver._cuda import cuda_device
from max.engine import InferenceSession, SessionOptions
from tensor import TensorSpec
from max.graph.ops import sin
import random
fn build_graph(tensor_type: TensorType) raises -> Graph:
var graph = Graph(
name="graph-with-single-sin-op",
in_types=List(Type(tensor_type)),
out_types=List(Type(tensor_type)),
)
graph.output(sin(graph[0]))
return graph
fn main() raises:
alias rows = 2
alias columns = 3
alias dtype = DType.float32
alias tensor_type = TensorType(dtype, rows, columns)
var graph = build_graph(tensor_type)
var device = cuda_device() if has_nvidia_gpu_accelerator() else cpu_device()
var session = InferenceSession(SessionOptions(device))
var model = session.load(graph)
var input_tensor = DeviceTensor(TensorSpec(dtype, rows, columns), cpu_device(), None).to_tensor[dtype, 2]()
var num_bytes = input_tensor.spec().num_elements()
var ptr = input_tensor.unsafe_ptr()
random.rand(ptr, num_bytes)
var device_tensor = input_tensor.to_device_tensor().move_to(device)
var result = model.execute(device_tensor)
var output_tensor = result[0].take().to_device_tensor().move_to(cpu_device())
var output = output_tensor.to_tensor[dtype, 2]()
print(output)
No custom ops required since max.graph.ops.sin exists.
The Mojo Graph and Driver APIs do support running graphs (with custom ops) on a GPU. I will warn that the Mojo Graph and Driver interfaces have diverged from the Python APIs, and may not have all the features that are present in the Python APIs.
With the latest MAX nightly, the following code should place a graph on an accelerator. For inputs and outputs, Driver Tensors are allocated (note: not a max.tensor Tensor) and placed on the accelerator, with the input being initialized on host and moved to accelerator and output being retrieved from the accelerator and moved to the host.
from max.driver import accelerator_device, cpu_device, Tensor, Device
from max.engine import InferenceSession, SessionOptions
from max.graph import Graph, TensorType, ops
def main():
alias VECTOR_WIDTH = 6
host_device = cpu_device()
gpu_device = accelerator_device()
graph = Graph(TensorType(DType.float32, VECTOR_WIDTH))
result = ops.sin(graph[0])
graph.output(result)
options = SessionOptions(gpu_device)
session = InferenceSession(options)
model = session.load(graph)
input_tensor = Tensor[DType.float32, 1]((VECTOR_WIDTH), host_device)
for i in range(VECTOR_WIDTH):
input_tensor[i] = 1.25
results = model.execute(input_tensor^.move_to(gpu_device))
output = results[0].take().to_device_tensor().move_to(host_device).to_tensor[DType.float32, 1]()
print(output)
Owen and I collided on this, but one difference in our two examples is that we recently generalized the Mojo Driver interface to not be hardcoded to CUDA-compatible devices to start to align with the Python Driver API.
The Mojo API Reference about the MAX Driver Module used to be present on the Modular Webpage, but at the moment it seems to be removed both from the stable and nightly API. I’d be super glad to have it back!
Yeah, this was removed a few months ago to avoid confusion with the Python Driver API as that was brought up. We’re looking to publish the Mojo Driver API docs soon though. They will likely be in nightly docs first.