Seeking Guidance: Low-Latency MedGemma Inference with Mojo 🔥

BradLarson · June 10, 2025, 3:56pm

Thanks for the interest! As others have pointed out, a great place to start would be to run or serve the MedGemma-4B-it model using our Python MAX APIs. It shares an architecture with Gemma3-4B-it, which we support natively with an accelerated graph in MAX. Just swap out google/gemma-3-4b-it on that page with google/medgemma-4b-it and you’ll get an accelerated model that should run well on your RTX A5000.

While this is with our Python MAX API, under the hood everything is driven by Mojo. These models are built as computational graphs, with each node in that computation defined by one of our (now) open-sourced Mojo kernels. The entire graph is compiled and optimized through our graph compiler in MAX, and all orchestration for graph construction is currently staged using Python.

We leaned into Python as an orchestration language for a few reasons that I describe elsewhere, but that doesn’t mean that we won’t build a Mojo path for this in the future. Right now, when it comes to full model construction, I do highly recommend looking at a mix of using our Python API for defining the model and Mojo for writing the high-performance kernels (and other portions via the new Python → Mojo interoperability).

If you really want to dig into how this works, the full code for our Gemma 3 multimodal architecture is available in the modular repository.

Topic		Replies	Views
Problem statement Mojo	1	91	March 13, 2025
Modular: Modular Platform 25.3: 450K+ Lines of Open Source Code and pip Packaging Content blog	0	49	May 6, 2025
Modular: Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon Content blog	1	37	May 30, 2025
Introduce yourself :wave: General	40	725	March 7, 2025
MAX on CPU doubt and request MAX discussion , feature-request , 24_6	2	113	January 10, 2025

Seeking Guidance: Low-Latency MedGemma Inference with Mojo 🔥

Related topics