Multi GPU support for Gemma 3

BradLarson · April 7, 2025, 4:47pm

Sorry about MAX not having a drop-in solution today for your preferred Gemma 3 model architecture in a multi-GPU configuration. There are two things we need to expand support in MAX for: a native MAX Graph implementation of the Gemma3ForConditionalGeneration architecture family, and broadening multi-GPU support beyond the Llama-family models. Both are being tracked internally, although I can’t promise when they’ll be available.

As I mention in this post, the full Python source code for the multi-GPU DistributedLlama3 model architecture is available. If you really wanted to hack on it yourself, it is possible to extend that architecture to possibly cover the Gemma 3 models, but we are only starting to pull together tutorials and documentation around building your own models in MAX or extending existing ones.

For quantization, we currently support Q6_K_M quantization on CPU only, and have received requests to look into Q8_0 quantization, so those capabilities are also tracked internally. QPTQ quantization is what we’ve favored so far on GPU for MAX models.

Thanks for the requests, that helps us determine priorities for bringup.

Topic		Replies	Views
MAX on CPU doubt and request MAX discussion , feature-request , 24_6	2	110	January 10, 2025
NVIDIA hardware support in MAX 24.6 MAX discussion , 24_6	12	251	December 18, 2024
Will Max support cerebras.ai hardware? MAX discussion , gpu , modular-content , 24_6	4	233	December 28, 2024
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	41	March 25, 2025
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	3	203	April 4, 2025

Multi GPU support for Gemma 3

Related topics