Apple Silicon GPU support in Mojo

BradLarson · September 21, 2025, 3:57pm

The latest nightly releases of Mojo (and our next stable release) include initial support for a new accelerator architecture: Apple Silicon GPUs!

We know that one of the biggest barriers to programming GPUs is access to hardware. It’s our hope that by making it possible to use Mojo to develop for a GPU present in every modern Mac, we can further democratize developing GPU-accelerated algorithms and AI models. This should also enable new paths of local-to-cloud development for AI models and more.

To get started, you need to have an Apple Silicon Mac (we support all M1 - M4 series chips) running macOS 15 or newer, with Xcode 16 or newer installed. The version of the Metal Shading Language we use (3.2, AIR bitcode version 2.7.0) needs the macOS 15 SDK, and you’ll get an error about incompatible bitcode versions if you run on an older macOS or use an older version of Xcode that doesn’t have the macOS 15 SDK.

You can clone our modular repository and try out one of our GPU function examples in the examples/mojo/gpu-functions directory. All but the reduction.mojo example should work on Apple Silicon GPUs today in the latest nightlies. Additionally, puzzles 1-15 of the Mojo GPU puzzles should now work on Apple Silicon GPUs with the latest nightly, and we’ve added osx-arm64 as a supported architecture there.

Current capabilities

This is just the beginning of our support for Apple Silicon GPUs, and many pieces of functionality still need to be built out. Known features that don’t work today include:

Intrinsics for many hardware capabilities
- Not all Mojo GPU examples work, such as reduction.mojo and the more complex matrix multiplication examples
- GPU puzzles 16 and above need more advanced hardware features
Basic MAX graphs
MAX custom ops
PyTorch interoperability
Running AI models
Serving AI models

I’ll emphasize that even simple MAX graphs, and by extension AI models, don’t yet run on Apple Silicon GPUs. In our Python APIs, accelerator_count() will still return 0 until we have basic MAX graph support enabled. Hopefully, that won’t be long.

Next steps

We’ve identified many of the technical blockers to progressively enable the above. The current list of what we plan to work on includes:

~~Handle MAX_THREADS_PER_BLOCK_METADATA and similar aliases~~ (commit)
~~Support GridDim, lane_id~~ (commit, commit)
Enable async_copy_*
~~Convert arguments of an array type to a pointer type~~ (internal)
~~Support bfloat16 on ARM devices~~ (commit)
Support SubBuffer
Enable atomic operations
Complete implementation of MetalDeviceContext::synchronize
Enable captured arguments
Support print and debug_assert

I apologize for some of the cryptic error messages you may get when hitting a piece of missing functionality, or encountering a system configuration we aren’t yet compatible with. We hope to improve the messaging over time, and to provide better guides for debugging failures.

How this works

To learn more about how Mojo code is compiled to target Apple Silicon GPUs, check out Amir Nassereldine’s detailed technical presentation from our recent Modular Community Meeting. He did amazing work in establishing the fundamentals during his summer internship, and we are now building on that to advance Mojo on this new architecture.

In brief, a multi-step process is used to compile and run Mojo code on an Apple Silicon GPU. First, we compile Mojo GPU functions to Apple Intermediate Representation (AIR) bitcode. This is done through lowering to LLVM IR, and then specifically converting to Metal-compatible AIR.

Mojo handles interactions with an accelerator through the DeviceContext type. In the case of Apple Silicon GPUs, we’ve specialized this into aMetalDeviceContext that handles the next stages in compilation and execution.

The MetalDeviceContext uses the Metal-cpp API to compile the AIR representation into a .metallib for execution on device. Once the .metallib is ready, the MetalDeviceContext manages a Metal CommandQueue, and buffers operations for moving data, running a GPU function, and more. All of this happens behind-the-scenes and a Mojo developer doesn’t need to worry about any of it.

Code that you’ve written to run on an NVIDIA or AMD GPUs should mostly just work on an Apple Silicon GPU, assuming no device-specific features were being used. Obviously, different patterns will be required to get the most performance out of each GPU, and we’re excited to explore this new optimization space on Apple Silicon GPUs with you.

Just the beginning

While we’d love help in bringing up Apple Silicon GPU support, some of the infrastructure for introducing support for new AIR intrinsics and compiling them to a .metallib currently requires Modular developers for implementation. We’ll get more of the basics in place before work moves primarily to the open-source standard library and kernels, at which point community members will be able to do a lot more to advance compatibility. Contributions are always welcome, but we don’t want you to hit missing non-public components and get frustrated by being unable to move forward.

We’ll share much more documentation and content on how to work with and optimize for this new hardware family, but we’re extremely excited about even these first few steps onto Apple Silicon GPUs. I’ll to try to keep this post up to date as we expand functionality.

HowardChu · September 21, 2025, 5:13pm

Excited to try this!

jklaivins · September 21, 2025, 6:15pm

Is there a docker image / container that configures this? I work on a m4 mac air, would be fun to test that out, but I work in docker containers as a main driver..

BradLarson · September 21, 2025, 7:11pm

Our Docker containers largely use Linux within them. Due to the requirement for Xcode tooling to build the .metallib from Mojo code, you’d need to have a virtualized macOS environment in the container. We haven’t yet built a container like that.

jklaivins · September 21, 2025, 7:13pm

Googling / AI Overlording around it looks like you can’t even mount the gpu from mac into a container anyways. Unless someone knows something I don’t, I’m guessing the guidance on mojo + metal is via host.

vguerra · September 21, 2025, 8:12pm

I went ahead an tried to run vector_addition.mojo within examples/mojo/gpu-functions but faced the following issue:

xcrun: error: sh -c ‘/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX26.0.sdk -find metallib 2> /dev/null’ failed with exit code 17664: (null) (errno=No such file or directory)

If you are facing this, you have to make sure of 2 things:

1. Right developer directory is selected

Make sure that you have selected the right developer directory:

$> xcode-select -p
/Applications/Xcode.app/Contents/Developer

if the output is different, fix it by running

sudo xcode-select --switch /Applications/Xcode.app/Contents/Developer

2. Metal toolchain is available

Make sure that the Metal toolchain is available.

$> xcrun -sdk macosx metal
metal: error: no input files

if the output is instead:

error: error: cannot execute tool 'metal' due to missing Metal Toolchain; use: xcodebuild -downloadComponent MetalToolchain

proceed to install the Metal toolchain:

xcodebuild -downloadComponent MetalToolchain

if you run xcrun -sdk macosx metal, it should give you the no input files error.

You should be able to run mojo code on the GPU

If everything is well set, you can now run the example code.

$> mojo vector_addition.mojo
Resulting vector: 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75 3.75

Dafmdev · October 4, 2025, 2:11am

Excited to try this! x2

Dafmdev · October 4, 2025, 3:37am

I don’t know how to express the happiness I feel right now.

Found GPU: Apple M1 Pro
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])

clattner · October 5, 2025, 3:24am

nice!

yiakwy-ml-core-team · November 12, 2025, 7:07am

Recently we are working on M3 Ultra, and I found its extremely useful in deploying agentic LLM in one stop with 512 GB memory.

I observed great performance for serving gpt-oss-120b . As for kernel DSA, I quickly realized clatter used to work in Apple, why should I try mojo ?

When one finds it hard to find equivalent of triton in Apple silicon, Clattner and his team is the answer !

fzngagan · November 28, 2025, 9:26am

Ah, I just realized that Mac support is a recent thing. It’d be lovely to be able to run mac gpus on jupyter lab too. Posted about it:

BradLarson · December 19, 2025, 9:57pm

Realized I hadn’t updated this thread in a while, despite some significant advancements in Apple silicon GPU support.

In the time since this was originally written, many intrinsics and other capabilities have been implemented, greatly expanding the surface of GPU programming compatible with these devices. That is now reflected in all but three of the non-NVIDIA-specific Mojo GPU puzzles now being compatible with Apple silicon GPUs! Those last three all require additions to our PyTorch interoperability to support the differences between Apple devices and CUDA / HIP ones, and we’re working on that.

Special thanks to @Ethan for landing several PRs to help us expand Apple silicon support!

Some basic one-operation MAX graphs using Mojo custom operations are functional today, but we’re still working to enable most basic graphs on the path to getting models working on this hardware.

We’ve also slightly broadened the range of Apple silicon devices compatible with Mojo, adding the M5 series devices in the nightly that went out yesterday (2025121805). Macs with M1-M5 systems-on-chip should be usable with Mojo now, although we are investigating some differences we’ve observed on older M1 devices.

As always, keep your eyes on the nightlies for the latest capabilities as we roll them out.

koliyo · January 30, 2026, 10:00am

Any comments on timeline for debugger for Apple Silicon, similar to mojo-cuda-gdb?

BradLarson · January 30, 2026, 4:25pm

I’ll be honest, that may take a little bit. We’re still working on some more basic debugging functionality like being able to print() or use debug_assert() within GPU functions on Apple silicon GPUs. Adding print support itself can be a journey.

Topic		Replies	Views
GPU Programming on Mac GPU Programming discussion , gpu , docs	14	815	September 21, 2025
How can i use mojo for an asic programming? General discussion	4	194	June 1, 2025
How to get Mojo to detect AMD integrated GPU (APU)? GPU Programming gpu	15	137	February 28, 2026
`has_apple_gpu_accelerator()` is False on jupyter-lab on my Macbook GPU Programming debugging	2	109	November 29, 2025
Error: failed to run the pass manager for offload functions Mojo gpu , mojo-compiler	3	137	October 4, 2025