I burned some Codex quota testing MAX training

Ethan · May 12, 2026, 8:01pm

I had some Codex quota left this week, so I used it on a small experiment: can MAX be made to support PyTorch-like training behavior, and how fast is a compiled train step?

I built a prototype standalone package, max_training, with limited reverse-mode autograd, parameters, Linear, MSE loss, SGD-style updates, compiled train steps, MLIR inspection, and PyTorch comparison benchmarks.

RTX 5090 benchmark results:

github.com/YichengDWu/max_training

docs/benchmarks/rtx5090.md

main

# RTX 5090 Benchmark Results

These results were measured on a Vast.ai 1x RTX 5090 instance:

- GPU: NVIDIA GeForce RTX 5090, compute capability 12.0
- Driver: 580.95.05
- CUDA: 13.0
- Python: 3.12.3
- MAX/Modular: 26.4.0.dev2026051206
- PyTorch: 2.11.0+cu130
- `ptxas`: `/usr/local/cuda/bin/ptxas`

The initial test run passed:

```text
10 passed, 2 warnings in 1020.63s (0:17:00)
```

The long first-run time was from MAX building its interpreter op cache. Later
runs reuse that cache.

This file has been truncated. show original

For one larger MLP training-step benchmark, I saw:

torch_eager                  0.865 ms
torch_compile[max-autotune]  0.808 ms
max_compile                  0.581 ms

MAX compile time was much higher, around 100s, so this is only an early feasibility result, not a production claim.

Curious if this benchmark setup looks reasonable to people familiar with MAX internals. Are these numbers plausible, or am I accidentally benchmarking an easy/special case?

owenhilyard · May 13, 2026, 6:50pm

This seems roughly normal, MAX from pytorch tends to compile a lot slower than native MAX since MAX expects slightly higher level kernels as input. If you are willing to hand-write kernels like some past experiments have done, you may see fairly dramatic compile-time improvements. Some very old numbers on my 4090 had MAX beating JAX in compile times by nearly 100x (JAX is known for bad compile times) and in runtime performance by ~2x. This was multiple years of very heavy development ago, so that has likely changed.

Topic		Replies	Views
[Hackathon] YOLOv8 Performance Benchmark: PyTorch vs. Modular MAX Community Showcase modular-hack-weekend	2	173	June 29, 2025
Help compiling MLP in MAX MAX debugging	4	133	February 23, 2026
MAX 26.1: eager to compile contract, lowering pipeline, kernel selection across GPUs, and extension points for custom ops Mojo discussion	8	217	February 3, 2026
First-class autodiff/training in MAX: timeline, and supersede-or-compose with community autograd libraries? MAX	2	177	July 4, 2026
Porting various models to MAX MAX	5	294	May 1, 2025

I burned some Codex quota testing MAX training

Related topics