Multi GPU support for Gemma 3

Sorry about MAX not having a drop-in solution today for your preferred Gemma 3 model architecture in a multi-GPU configuration. There are two things we need to expand support in MAX for: a native MAX Graph implementation of the Gemma3ForConditionalGeneration architecture family, and broadening multi-GPU support beyond the Llama-family models. Both are being tracked internally, although I can’t promise when they’ll be available.

As I mention in this post, the full Python source code for the multi-GPU DistributedLlama3 model architecture is available. If you really wanted to hack on it yourself, it is possible to extend that architecture to possibly cover the Gemma 3 models, but we are only starting to pull together tutorials and documentation around building your own models in MAX or extending existing ones.

For quantization, we currently support Q6_K_M quantization on CPU only, and have received requests to look into Q8_0 quantization, so those capabilities are also tracked internally. QPTQ quantization is what we’ve favored so far on GPU for MAX models.

Thanks for the requests, that helps us determine priorities for bringup.

1 Like