Where are the SOTA quantized models?

A quick look at the models catalog lacking ANY (!?) quantizations and my enthusiasm for this project is gone.

Arguments? Timeline?
Thanks
G.

I‘m actually curious to know the answer to this myself.

It’s possible that quantization-ready kernels have not yet been implemented yet (I don’t know!) but it’s also possible that quantized models are not considered conceptually separate from the non quantized ones. If that’s the case, using one or the other would just require different flags on initialisation (hugging face transformers uses this approach to some degree)

I think without quantization, whatever speed they boast, the Modular stuff is not competitive, but I’d love to be informed differently. And have some competition for Nvidia.

Off the top of my head: With some 2 bit quant I’ve been able to run (ok, walk :winking_face_with_tongue:) DeepSeek 671B 167GB download on a 3435 256GB Xeon with 2x 20GB RTX 4000 Ada.

Now show me, please, Chris!
Thanks for any enlightenment
G.

There are multiple kinds of pre-quantized models supported in MAX (documentation here). The framework automatically detects the quantization used in the source weights, which can also vary from weight to weight.

For CPU-based models the following quantization types are supported:

  • q4_0
  • q4_k
  • q6_k

And for GPU-accelerated models, we support:

  • QPTQ
  • AWQ

For pure numerical datatypes, float32 and bfloat16 are supported natively across GPUs and we are actively working on fp8 formats.

The limitations on what quantization types are supported on what platform are purely for which kernels have been brought up and optimized so far. Those to date have been driven by demand, with llama.cpp’s quantization formats being popular for small models on CPU and QPTQ being used for larger models on GPUs.

This is an area where we’d welcome community contributions, because all of our Mojo quantization-aware kernels are in open source today and anyone can add to them and expand support in areas you want them to go (q4_k on GPU, lower-bit quantizations, etc.). As I mentioned in a previous post we even have some great items to give away to contributors that help expand our kernel support!

2 Likes

Thanks a ton for the real answer, Brad!!
Now this looks entirely different and some rrreally bad comm / documentation.
What I’ve been gleaning at first sight (after listening to half a dozen YouTube talks by Chris) the 500+ models, all in bfloat16 and often not exactly hot.

Where is the ref to the quantized models?

I’m not up to the kernel task, but I’ve done some rrreally complicated docs stuff (Dynatrace) and when I’m hooked and I’ve read up I’ll be in.

Cheers
G., off searching for the quant models stuff.

For the models, the quantization isn’t considered separately from the model, so for example if you look at the entry for the Llama 8B distilled DeepSeek R1: DeepSeek-R1-Distill-Llama-8B-Q4_K_M Model | MAX Builds you can see the available quantizations in the drop-down alongside the bfloat16 and other base datatypes. The total model count is for the number of distinct models, and the quantizations are variants of those.

Sorry if that doesn’t come across clearly, our goal has been to make this as transparent as possible so that people don’t need to explicitly worry about manual configurations here.

Things look much different from this vantage point - as I see sizes for CPU/GPU with matching quants.
I was unlucky enough to just run across bf16, like the DeepSeek Qwen distill.

As much as I’m a techie and was hooked by Chris as such, I came to modular.com to run the SOTA models I’m interested in on reasonably prized prosumer HW.

For business customers, where the money is (paid support arm?), maybe the Models page (only in main dropdown) could be featured a bit more prominently amidst all the techie stuff on the landing page.

Some quant mention on the Filters sidebar would have avoided my confusion.
(For myself it’s simple math then what fits on my HW.)

Edit: Maybe bitness would be the the better term for the span from 16 bits (or if there still is sth like 32) and 1.58 to give a better feel for the size (/parameter).

The models detail pages look fine at the first glance.

Thanks
G.