Examples of custom CPU / GPU operations in Mojo

We have some new additions in the MAX nightlies, post 25.1 release, for custom operation programming:

  • A new fused_attention example (alongside the others in the MAX repo) that shows how to construct an optimized implementation of the FlashAttention 2 calculation in Mojo, with both CPU and GPU paths. This starts getting into even more advanced GPU programming concepts in Mojo.
  • A new matrix_multiplication example that illustrates how to apply progressive GPU optimizations to matrix multiplication in Mojo. Each new algorithm introduces different capabilities for optimizing memory accesses on the GPU, leveraging the abstractions from the Mojo layout module. The last algorithm shows how to access Tensor Cores from Mojo. These roughly follow steps 1-6 in this great blog post.
  • As a source-breaking change, we’ve replaced MojoCallContextPtr with DeviceContextPtr in custom operation interfaces. This is part of an internal interface simplification.
  • Joe’s been doing some great work on expanding API documentation within the gpu module, so there’s a lot more detail contained in the cluster, globals, id, intrinsics, and semaphore module API docs.
6 Likes

There’s now an example that covers this I think as well, noting for future thread-crawlers: max/examples/graph-api/basics at main · modular/max · GitHub

Two quick updates for the latest MAX nightlies:

  • Abdul added a new custom op example that shows how to compute the histogram of a tensor. It’s not super efficient at present, but demonstrates how to approach a problem like this in a custom op.
  • We’ve published the Mojo MAX Driver API docs, which I know was requested upthread. We’re still evolving the interfaces here, so keep an eye on the nightly docs as we work on these.

Will we be able to execute MAX Models with Device Tensors via a TensorMap as well? Currently the TensorMap is quite valuable to me, but it only works with the standard tensor.Tensor. We have seen that Model executions can indeed work with DeviceTensors, but it is still fairly static.

To be more specific, I would like to do the following:

var model = Model(...)
var device_tensor = DeviceTensor(...)
var tensor_map = TensorMap(...)
tensor_map.borrow[dtype, rank](device_tensor...)
tensor_map_of_device_tensors = model.execute(tensor_map^) # execute the MAX model with this device tensor map

Would love to learn more about what’s next for Mode Execution in Mojo. :slight_smile:

Questions about the max.engine interface could be their own thread by themselves, but I’ll try to answer as best as a I can:

Currently, the tensor input path in the engine that takes named tensors hasn’t been migrated over to support GPUs, so I believe it only supports unnamed lists of tensor inputs. The Mojo interface also needs to be updated with some of the capabilities from the Python engine interface.

If you had specific needs for named tensor maps, and applications that would be enabled by them, please let us know and that could factor into planning for these interfaces.

1 Like

Many of us were at an event this last week, so a little late on this update:

  • As a source-breaking change, num_dps_outputs has been removed from @compiler.register and ManagedTensorSlice is no longer used for operation inputs and outputs. Instead, InputTensor is now used for all tensor inputs to a custom operation and OutputTensor for all outputs. This means that what is an output and what an input is now typed to help prevent accidentally writing to an input or misaligning the number of specified outputs with the actual outputs. See the updated examples and tutorial for how this looks now.
  • foreach now raises, so the operations using this need to propagate or handle potential errors.
  • Benchmarks have been added to the matrix multiplication optimization example, where if you run magic run benchmark on a MAX-compatible GPU you can see the effect that each progressive optimization has on the matmul operation’s FLOPs.
  • The Mandelbrot example has had its calculation simplified and it now prints ASCII art for the fractal.

@BradLarson thanks for the InputTensor/OutputTensor additions! really improved the examples.

btw- the matrix multiplication optimization example benchmarks.mojo was missing the “tensor_core” benchmark, and the mma instruction shape needed a minor correction.
I opened this PR on max-recipes: Fix tensor core instruction shape by nlaanait · Pull Request #19 · modular/max-recipes · GitHub

Thanks again for the great examples!

1 Like

Appreciate the help! I’m guessing the MMA_K shape there may have been left over from a previous bfloat16 configuration, instead of the float32 that this is defaulted to now. Forgot to enable the Tensor Core path for benchmarking, so thanks for the reminder. I’ll update the main MAX repository examples with this tomorrow.

Another great enhancement just landed in the latest MAX nightly: you no longer need a separate build step to use Mojo custom operations with a Python MAX Graph. The old way of building a .mojopkg for the Mojo operations and then pointing to it from your graph still works, but there’s an easier path now where you only need to point to the directory containing your Mojo code and compilation will occur transparently at graph building time.

For example, the way this looks in the addition.py example is:

mojo_kernels = Path(__file__).parent / "kernels"

graph = Graph(
    "addition",
    forward=lambda x: ops.custom(
        name="add_one_custom",
        values=[x],
        out_types=[TensorType(dtype=x.dtype, shape=x.tensor.shape)],
    )[0].tensor,
    input_types=[
        TensorType(dtype, shape=[rows, columns]),
    ],
    custom_extensions=[mojo_kernels],
)

When you then run python3 addition.py, the Mojo compiler is invoked to build a package from the ./kernels directory transparently at graph construction time. In the next nightly, all custom operation examples will be simplified to use this new workflow, which you’ll see will cut out a lot of code.

7 Likes