Multi GPU support for Gemma 3

zacksiri · April 7, 2025, 3:35am

Hi

I just started experimenting with modular’s ecosystem. I managed to get up and running with llama 3 family of models however my daily driver for my main work is Gemma 3 class of models.

When I tried loading the model by setting the --model-path and --weight-path I get the error that says multi gpu support only works with llama models.

Would it be possible to get multi gpu support with non llama models? Also I use q8_0 or q6_k_l would these quants be supported?

I have 2 A4500 GPU with Nvlink.

BradLarson · April 7, 2025, 4:47pm

Sorry about MAX not having a drop-in solution today for your preferred Gemma 3 model architecture in a multi-GPU configuration. There are two things we need to expand support in MAX for: a native MAX Graph implementation of the Gemma3ForConditionalGeneration architecture family, and broadening multi-GPU support beyond the Llama-family models. Both are being tracked internally, although I can’t promise when they’ll be available.

As I mention in this post, the full Python source code for the multi-GPU DistributedLlama3 model architecture is available. If you really wanted to hack on it yourself, it is possible to extend that architecture to possibly cover the Gemma 3 models, but we are only starting to pull together tutorials and documentation around building your own models in MAX or extending existing ones.

For quantization, we currently support Q6_K_M quantization on CPU only, and have received requests to look into Q8_0 quantization, so those capabilities are also tracked internally. QPTQ quantization is what we’ve favored so far on GPU for MAX models.

Thanks for the requests, that helps us determine priorities for bringup.

nlaanait · April 7, 2025, 7:28pm

@BradLarson Great to hear that model building docs are in the works! Happy to give feedback on the docs once you have a draft.

btw I started the implementation of mamba-2 architecture in MAX: max-mamba.

warshanks · June 24, 2025, 4:57am

Just chiming in to say that I would love multi GPU capability for the Gemma 3 family. I’d love to be able to replace vLLM with Max for my Medgemma 4b deployment.

Topic		Replies	Views
Customize a model - gemma3 Serving	1	49	August 15, 2025
MAX on CPU doubt and request MAX discussion , feature-request , 24_6	3	131	July 9, 2025
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	49	March 25, 2025
NVIDIA hardware support in MAX 24.6 MAX discussion , 24_6	13	297	June 16, 2025
MAX Model Repository MAX	3	71	August 6, 2025

Multi GPU support for Gemma 3

Related topics