Where are the SOTA quantized models?

There are multiple kinds of pre-quantized models supported in MAX (documentation here). The framework automatically detects the quantization used in the source weights, which can also vary from weight to weight.

For CPU-based models the following quantization types are supported:

  • q4_0
  • q4_k
  • q6_k

And for GPU-accelerated models, we support:

  • QPTQ
  • AWQ

For pure numerical datatypes, float32 and bfloat16 are supported natively across GPUs and we are actively working on fp8 formats.

The limitations on what quantization types are supported on what platform are purely for which kernels have been brought up and optimized so far. Those to date have been driven by demand, with llama.cpp’s quantization formats being popular for small models on CPU and QPTQ being used for larger models on GPUs.

This is an area where we’d welcome community contributions, because all of our Mojo quantization-aware kernels are in open source today and anyone can add to them and expand support in areas you want them to go (q4_k on GPU, lower-bit quantizations, etc.). As I mentioned in a previous post we even have some great items to give away to contributors that help expand our kernel support!

2 Likes