Thanks for the interest! As others have pointed out, a great place to start would be to run or serve the MedGemma-4B-it model using our Python MAX APIs. It shares an architecture with Gemma3-4B-it, which we support natively with an accelerated graph in MAX. Just swap out google/gemma-3-4b-it
on that page with google/medgemma-4b-it
and you’ll get an accelerated model that should run well on your RTX A5000.
While this is with our Python MAX API, under the hood everything is driven by Mojo. These models are built as computational graphs, with each node in that computation defined by one of our (now) open-sourced Mojo kernels. The entire graph is compiled and optimized through our graph compiler in MAX, and all orchestration for graph construction is currently staged using Python.
We leaned into Python as an orchestration language for a few reasons that I describe elsewhere, but that doesn’t mean that we won’t build a Mojo path for this in the future. Right now, when it comes to full model construction, I do highly recommend looking at a mix of using our Python API for defining the model and Mojo for writing the high-performance kernels (and other portions via the new Python → Mojo interoperability).
If you really want to dig into how this works, the full code for our Gemma 3 multimodal architecture is available in the modular
repository.