Multi GPU support for Gemma 3

Hi

I just started experimenting with modular’s ecosystem. I managed to get up and running with llama 3 family of models however my daily driver for my main work is Gemma 3 class of models.

When I tried loading the model by setting the --model-path and --weight-path I get the error that says multi gpu support only works with llama models.

Would it be possible to get multi gpu support with non llama models? Also I use q8_0 or q6_k_l would these quants be supported?

I have 2 A4500 GPU with Nvlink.

Sorry about MAX not having a drop-in solution today for your preferred Gemma 3 model architecture in a multi-GPU configuration. There are two things we need to expand support in MAX for: a native MAX Graph implementation of the Gemma3ForConditionalGeneration architecture family, and broadening multi-GPU support beyond the Llama-family models. Both are being tracked internally, although I can’t promise when they’ll be available.

As I mention in this post, the full Python source code for the multi-GPU DistributedLlama3 model architecture is available. If you really wanted to hack on it yourself, it is possible to extend that architecture to possibly cover the Gemma 3 models, but we are only starting to pull together tutorials and documentation around building your own models in MAX or extending existing ones.

For quantization, we currently support Q6_K_M quantization on CPU only, and have received requests to look into Q8_0 quantization, so those capabilities are also tracked internally. QPTQ quantization is what we’ve favored so far on GPU for MAX models.

Thanks for the requests, that helps us determine priorities for bringup.

1 Like

@BradLarson Great to hear that model building docs are in the works! Happy to give feedback on the docs once you have a draft.

btw I started the implementation of mamba-2 architecture in MAX: max-mamba.

3 Likes