Where are the SOTA quantized models?

gue · July 28, 2025, 10:02am

A quick look at the models catalog lacking ANY (!?) quantizations and my enthusiasm for this project is gone.

Arguments? Timeline?
Thanks
G.

ephemer · July 28, 2025, 10:58am

I‘m actually curious to know the answer to this myself.

It’s possible that quantization-ready kernels have not yet been implemented yet (I don’t know!) but it’s also possible that quantized models are not considered conceptually separate from the non quantized ones. If that’s the case, using one or the other would just require different flags on initialisation (hugging face transformers uses this approach to some degree)

gue · July 28, 2025, 2:35pm

I think without quantization, whatever speed they boast, the Modular stuff is not competitive, but I’d love to be informed differently. And have some competition for Nvidia.

Off the top of my head: With some 2 bit quant I’ve been able to run (ok, walk ) DeepSeek 671B 167GB download on a 3435 256GB Xeon with 2x 20GB RTX 4000 Ada.

Now show me, please, Chris!
Thanks for any enlightenment
G.

BradLarson · July 28, 2025, 3:00pm

There are multiple kinds of pre-quantized models supported in MAX (documentation here). The framework automatically detects the quantization used in the source weights, which can also vary from weight to weight.

For CPU-based models the following quantization types are supported:

q4_0
q4_k
q6_k

And for GPU-accelerated models, we support:

QPTQ
AWQ

For pure numerical datatypes, float32 and bfloat16 are supported natively across GPUs and we are actively working on fp8 formats.

The limitations on what quantization types are supported on what platform are purely for which kernels have been brought up and optimized so far. Those to date have been driven by demand, with llama.cpp’s quantization formats being popular for small models on CPU and QPTQ being used for larger models on GPUs.

This is an area where we’d welcome community contributions, because all of our Mojo quantization-aware kernels are in open source today and anyone can add to them and expand support in areas you want them to go (q4_k on GPU, lower-bit quantizations, etc.). As I mentioned in a previous post we even have some great items to give away to contributors that help expand our kernel support!

gue · July 28, 2025, 3:22pm

Thanks a ton for the real answer, Brad!!
Now this looks entirely different and some rrreally bad comm / documentation.
What I’ve been gleaning at first sight (after listening to half a dozen YouTube talks by Chris) the 500+ models, all in bfloat16 and often not exactly hot.

Where is the ref to the quantized models?

I’m not up to the kernel task, but I’ve done some rrreally complicated docs stuff (Dynatrace) and when I’m hooked and I’ve read up I’ll be in.

Cheers
G., off searching for the quant models stuff.

BradLarson · July 28, 2025, 10:00pm

For the models, the quantization isn’t considered separately from the model, so for example if you look at the entry for the Llama 8B distilled DeepSeek R1: DeepSeek-R1-Distill-Llama-8B-Q4_K_M Model | MAX Builds you can see the available quantizations in the drop-down alongside the bfloat16 and other base datatypes. The total model count is for the number of distinct models, and the quantizations are variants of those.

Sorry if that doesn’t come across clearly, our goal has been to make this as transparent as possible so that people don’t need to explicitly worry about manual configurations here.

gue · July 29, 2025, 9:49am

Things look much different from this vantage point - as I see sizes for CPU/GPU with matching quants.
I was unlucky enough to just run across bf16, like the DeepSeek Qwen distill.

As much as I’m a techie and was hooked by Chris as such, I came to modular.com to run the SOTA models I’m interested in on reasonably prized prosumer HW.

For business customers, where the money is (paid support arm?), maybe the Models page (only in main dropdown) could be featured a bit more prominently amidst all the techie stuff on the landing page.

Some quant mention on the Filters sidebar would have avoided my confusion.
(For myself it’s simple math then what fits on my HW.)

Edit: Maybe bitness would be the the better term for the span from 16 bits (or if there still is sth like 32) and 1.58 to give a better feel for the size (/parameter).

The models detail pages look fine at the first glance.

Thanks
G.

Topic		Replies	Views
MAX on CPU doubt and request MAX discussion , feature-request , 24_6	3	128	July 9, 2025
MAX Model Repository MAX	3	52	August 6, 2025
Multi GPU support for Gemma 3 MAX	3	311	June 24, 2025
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	49	March 25, 2025
Not able to serve openai/gpt-oss newer model Installation & Packaging	2	123	August 6, 2025

Where are the SOTA quantized models?

Related topics