Seeking Guidance: Low-Latency MedGemma Inference with Mojo 🔥

Hi everyone,

I hope you’re doing great!

I’m working on developing a real-time medical inference tool to assist professionals in generating differential diagnoses. For this, I plan to run the MedGemma-4B-it model locally on my machine using Mojo🔥, focusing on low-latency inference.

:wrench: My system specs:

  • CPU: Intel Core i9-11th gen
  • GPU: NVIDIA RTX A5000 (16 GB VRAM)
  • RAM: 128 GB
  • OS: Ubuntu 22.04

:brain: Goal:

To build an inference engine in Mojo that can serve MedGemma-4B-it with minimal latency, suitable for real-time or near-real-time medical use.

:thinking: What I’m looking for:

  1. Where to start: I’m new to Mojo, and I’m unsure how to begin building an inference engine.
  2. Model loading and runtime: How do I load and serve the MedGemma-4B-it model efficiently in Mojo?
  3. GPU acceleration: How can I make use of Mojo’s capabilities to leverage the GPU (RTX A5000)?
  4. Best practices: Are there examples, templates, or prior projects doing similar work?
  5. Mentorship/volunteer guidance: I’d be extremely grateful for any mentorship, guidance, or even a learning-based internship opportunity related to this area.

:eyes: What I’ve tried so far:

  • Read the basic Mojo docs.
  • Familiar with PyTorch and Hugging Face models.
  • Installed MedGemma and successfully run it with PyTorch on GPU.

:light_bulb: End Goal:

Build a production-ready prototype that can help clinicians interact with medical images and patient symptoms to receive intelligent, AI-assisted diagnostic suggestions.

I’m passionate about AI for healthcare, and this project means a lot to me. If you have any advice, resources, or could guide me in the right direction—it would mean the world.

Thank you in advance!

Modular already has an inference engine built for Mojo, you sure you don’t want to use that? It supports over 500 models from HuggingFace.

1 Like

Actually, This can used but I wanted to build things from scratch as much as possible so I would understand the internals in a better way.

Thanks!

Welcome to the Modular community. Modular (and many of its community members) have tried to deal with the AI stack in the past. It was such an uphill slog that Modular came into existence so that people doing it “from scratch” wouldn’t pull their hair out. This is just one of the many reasons implied by @melodyogonna 's suggestion to use the inference engine.

Put another way, there’s plenty to learn from building a car from scratch. Alternatively, you can still learn everything about a car starting with a working one and breaking it here and there. :slight_smile: . Modular has a team of experts who have been building the MAX inference engine from the beginning. The source code for just the MAX engine can be found here

That being said, Modular is also built on the ability to get deep and mess around and do everything you want from scratch if that’s your desire. I would assert it’s hard to end up near your desired destination if you don’t know what it looks like. Therefore, I would suggest getting started with max as it requires no mojo. At the end of those, Ehsan the author, shows you how to load a custom model. If you’d still like to do it from it scratch, I would then suggest the Get started with GPU programming which introduces you to mojo through running code. At the bottom of that page, there are many points to jump off from depending on your interest and skill level.

To reiterate, Modular is literally designed to achieve your goal so that you don’t have to do it from scratch. If you would still like to do it from scratch, I hope I’ve offered a path above. HTH.

Thanks for the interest! As others have pointed out, a great place to start would be to run or serve the MedGemma-4B-it model using our Python MAX APIs. It shares an architecture with Gemma3-4B-it, which we support natively with an accelerated graph in MAX. Just swap out google/gemma-3-4b-it on that page with google/medgemma-4b-it and you’ll get an accelerated model that should run well on your RTX A5000.

While this is with our Python MAX API, under the hood everything is driven by Mojo. These models are built as computational graphs, with each node in that computation defined by one of our (now) open-sourced Mojo kernels. The entire graph is compiled and optimized through our graph compiler in MAX, and all orchestration for graph construction is currently staged using Python.

We leaned into Python as an orchestration language for a few reasons that I describe elsewhere, but that doesn’t mean that we won’t build a Mojo path for this in the future. Right now, when it comes to full model construction, I do highly recommend looking at a mix of using our Python API for defining the model and Mojo for writing the high-performance kernels (and other portions via the new Python → Mojo interoperability).

If you really want to dig into how this works, the full code for our Gemma 3 multimodal architecture is available in the modular repository.

1 Like