Apple Silicon GPU support in Mojo

The latest nightly releases of Mojo (and our next stable release) include initial support for a new accelerator architecture: Apple Silicon GPUs!

We know that one of the biggest barriers to programming GPUs is access to hardware. It’s our hope that by making it possible to use Mojo to develop for a GPU present in every modern Mac, we can further democratize developing GPU-accelerated algorithms and AI models. This should also enable new paths of local-to-cloud development for AI models and more.

To get started, you need to have an Apple Silicon Mac (we support all M1 - M4 series chips) running macOS 15 or newer, with Xcode 16 or newer installed. The version of the Metal Shading Language we use (3.2, AIR bitcode version 2.7.0) needs the macOS 15 SDK, and you’ll get an error about incompatible bitcode versions if you run on an older macOS or use an older version of Xcode that doesn’t have the macOS 15 SDK.

You can clone our modular repository and try out one of our GPU function examples in the examples/mojo/gpu-functions directory. All but the reduction.mojo example should work on Apple Silicon GPUs today in the latest nightlies. Additionally, puzzles 1-15 of the Mojo GPU puzzles should now work on Apple Silicon GPUs with the latest nightly, and we’ve added osx-arm64 as a supported architecture there.

Current capabilities

This is just the beginning of our support for Apple Silicon GPUs, and many pieces of functionality still need to be built out. Known features that don’t work today include:

  • Intrinsics for many hardware capabilities
    • Not all Mojo GPU examples work, such as reduction.mojo and the more complex matrix multiplication examples
    • GPU puzzles 16 and above need more advanced hardware features
  • Basic MAX graphs
  • MAX custom ops
  • PyTorch interoperability
  • Running AI models
  • Serving AI models

I’ll emphasize that even simple MAX graphs, and by extension AI models, don’t yet run on Apple Silicon GPUs. In our Python APIs, accelerator_count() will still return 0 until we have basic MAX graph support enabled. Hopefully, that won’t be long.

Next steps

We’ve identified many of the technical blockers to progressively enable the above. The current list of what we plan to work on includes:

  • Handle MAX_THREADS_PER_BLOCK_METADATA and similar aliases (commit)
  • Support GridDim, lane_id (commit, commit)
  • Enable async_copy_*
  • Convert arguments of an array type to a pointer type (internal)
  • Support bfloat16 on ARM devices (commit)
  • Support SubBuffer
  • Enable atomic operations
  • Complete implementation of MetalDeviceContext::synchronize
  • Enable captured arguments
  • Support print and debug_assert

I apologize for some of the cryptic error messages you may get when hitting a piece of missing functionality, or encountering a system configuration we aren’t yet compatible with. We hope to improve the messaging over time, and to provide better guides for debugging failures.

How this works

To learn more about how Mojo code is compiled to target Apple Silicon GPUs, check out Amir Nassereldine’s detailed technical presentation from our recent Modular Community Meeting. He did amazing work in establishing the fundamentals during his summer internship, and we are now building on that to advance Mojo on this new architecture.

In brief, a multi-step process is used to compile and run Mojo code on an Apple Silicon GPU. First, we compile Mojo GPU functions to Apple Intermediate Representation (AIR) bitcode. This is done through lowering to LLVM IR, and then specifically converting to Metal-compatible AIR.

Mojo handles interactions with an accelerator through the DeviceContext type. In the case of Apple Silicon GPUs, we’ve specialized this into aMetalDeviceContext that handles the next stages in compilation and execution.

The MetalDeviceContext uses the Metal-cpp API to compile the AIR representation into a .metallib for execution on device. Once the .metallib is ready, the MetalDeviceContext manages a Metal CommandQueue, and buffers operations for moving data, running a GPU function, and more. All of this happens behind-the-scenes and a Mojo developer doesn’t need to worry about any of it.

Code that you’ve written to run on an NVIDIA or AMD GPUs should mostly just work on an Apple Silicon GPU, assuming no device-specific features were being used. Obviously, different patterns will be required to get the most performance out of each GPU, and we’re excited to explore this new optimization space on Apple Silicon GPUs with you.

Just the beginning

While we’d love help in bringing up Apple Silicon GPU support, some of the infrastructure for introducing support for new AIR intrinsics and compiling them to a .metallib currently requires Modular developers for implementation. We’ll get more of the basics in place before work moves primarily to the open-source standard library and kernels, at which point community members will be able to do a lot more to advance compatibility. Contributions are always welcome, but we don’t want you to hit missing non-public components and get frustrated by being unable to move forward.

We’ll share much more documentation and content on how to work with and optimize for this new hardware family, but we’re extremely excited about even these first few steps onto Apple Silicon GPUs. I’ll to try to keep this post up to date as we expand functionality.

25 Likes

Excited to try this!

1 Like

Is there a docker image / container that configures this? I work on a m4 mac air, would be fun to test that out, but I work in docker containers as a main driver..

Our Docker containers largely use Linux within them. Due to the requirement for Xcode tooling to build the .metallib from Mojo code, you’d need to have a virtualized macOS environment in the container. We haven’t yet built a container like that.

Googling / AI Overlording around it looks like you can’t even mount the gpu from mac into a container anyways. Unless someone knows something I don’t, I’m guessing the guidance on mojo + metal is via host.

I went ahead an tried to run vector_addition.mojo within examples/mojo/gpu-functions but faced the following issue:

xcrun: error: sh -c ‘/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX26.0.sdk -find metallib 2> /dev/null’ failed with exit code 17664: (null) (errno=No such file or directory)

If you are facing this, you have to make sure of 2 things:

1. Right developer directory is selected

Make sure that you have selected the right developer directory:

$> xcode-select -p
/Applications/Xcode.app/Contents/Developer

if the output is different, fix it by running

sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

2. Metal toolchain is available

Make sure that the Metal toolchain is available.

$> xcrun -sdk macosx metal
metal: error: no input files

if the output is instead:

error: error: cannot execute tool 'metal' due to missing Metal Toolchain; use: xcodebuild -downloadComponent MetalToolchain

proceed to install the Metal toolchain:

xcodebuild -downloadComponent MetalToolchain

if you run xcrun -sdk macosx metal, it should give you the no input files error.

You should be able to run mojo code on the GPU

If everything is well set, you can now run the example code.

$> mojo vector_addition.mojo
Resulting vector: 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75
13 Likes

Excited to try this! x2

1 Like

I don’t know how to express the happiness I feel right now.

Found GPU: Apple M1 Pro
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])
8 Likes

nice!

2 Likes

Recently we are working on M3 Ultra, and I found its extremely useful in deploying agentic LLM in one stop with 512 GB memory.

I observed great performance for serving gpt-oss-120b . As for kernel DSA, I quickly realized clatter used to work in Apple, why should I try mojo ?

When one finds it hard to find equivalent of triton in Apple silicon, Clattner and his team is the answer !

2 Likes