Gemma 4 is live on Modular Cloud with day zero support and the fastest performance on both NVIDIA and AMD. MAX delivers 15% higher throughput vs. vLLM on B200, and we’re the only inference provider shipping Gemma 4 on a framework we built ourselves.
Two multimodal models live now:
- Gemma 4 31B: dense, 256K context, built for deep reasoning across large inputs
- Gemma 4 26B A4B: MoE, 26B total params, only 4B active per forward pass
Both handle text, images, and video natively.
Modular Cloud runs on MAX, our inference framework that unifies GPU kernels, graph compilation, and high-performance serving in a single hardware-agnostic stack. When a new architecture drops, we’re not waiting on upstream support or porting hand-tuned kernels. We went from new weights to SOTA performance on two hardware platforms in days.
NVIDIA B200 or AMD MI355X. Same stack, same API. Pick the price-performance point that fits your workload.
→ Try Gemma 4 for free in the playground.
→ Read the full breakdown
→ Deploy Gemma 4 on a dedicated Modular Cloud endpoint.
Which model are you planning to try first? Let us know in the thread!
