Where are the SOTA quantized models?

BradLarson · July 28, 2025, 3:00pm

There are multiple kinds of pre-quantized models supported in MAX (documentation here). The framework automatically detects the quantization used in the source weights, which can also vary from weight to weight.

For CPU-based models the following quantization types are supported:

q4_0
q4_k
q6_k

And for GPU-accelerated models, we support:

QPTQ
AWQ

For pure numerical datatypes, float32 and bfloat16 are supported natively across GPUs and we are actively working on fp8 formats.

The limitations on what quantization types are supported on what platform are purely for which kernels have been brought up and optimized so far. Those to date have been driven by demand, with llama.cpp’s quantization formats being popular for small models on CPU and QPTQ being used for larger models on GPUs.

This is an area where we’d welcome community contributions, because all of our Mojo quantization-aware kernels are in open source today and anyone can add to them and expand support in areas you want them to go (q4_k on GPU, lower-bit quantizations, etc.). As I mentioned in a previous post we even have some great items to give away to contributors that help expand our kernel support!

Topic		Replies	Views
MAX on CPU doubt and request MAX discussion , feature-request , 24_6	3	139	July 9, 2025
Support GPTQ quant in MoE layer Models & Pipelines discussion	0	54	October 26, 2025
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	52	March 25, 2025
Multi GPU support for Gemma 3 MAX	3	401	June 24, 2025
MAX Model Repository MAX	3	79	August 6, 2025

Where are the SOTA quantized models?

Related topics