MAX on CPU doubt and request

Dasor · January 8, 2025, 8:04pm

As far as I understand (feel free to correct me) most AI inference workloads specially LLMS are memory bound so running inference on CPU can also be a great idea. So, I thought it would be great to also have a MAX server container for CPU.

I’ve also been trying to run some models with MAX using CPU using the modular/max-openai-api container and just removing the gpu option in the entrypoint.sh and although I’ve managed to run mixtral with it I can’t run the modularai/llama-3.1 model as in the GPU tutorial since once I make a request it crashes.

I don’t know if that is because the model only supports GPU (I’ve never heard of something like that) or if I may be doing something wrong.

BradLarson · January 8, 2025, 8:45pm

If you have reproducible steps for the Llama 3.1 model crashing on CPU, would you mind sharing them here or in a GitHub issue? MAX fully supports running efficiently on CPUs as well as GPUs, and there’s nothing GPU-specific about any of our models. I will say that bfloat16 weights will only work on GPU and on Intel CPUs, because ARM instructions are currently missing for bfloat16 operations.

When running on CPU, I highly recommend looking into using quantized weights. We support q4_0, q4_k, and q6_k quantization schemes, and MAX delivers state-of-the-art performance for many quantized models on CPU. Our server container should support this by removing the --use-gpu flag from the model and adding --quantization-encoding q4_k when pointing at modularai/llama-3.1 or another Hugging Face repository that has quantized weights in GGUF format.

In addition to using our container, you can set up Magic on your local machine, clone our MAX repository, and follow these steps serve a model on CPU or GPU using Magic and MAX.

Dasor · January 10, 2025, 7:59pm

Thanks for the recommendation, I will keep playing around with it!. I’ve also opened a Github issue as you requested. I tried with both AMD and Intel chips to check if it was the problem but both gave me the same error.

Topic		Replies	Views
Multi GPU support for Gemma 3 MAX	3	228	June 24, 2025
NVIDIA hardware support in MAX 24.6 MAX discussion , 24_6	13	264	June 16, 2025
It's here: MAX 24.6 and MAX GPU! :rocket: Official Announcements	0	164	December 17, 2024
Will Max support cerebras.ai hardware? MAX discussion , gpu , modular-content , 24_6	4	252	December 28, 2024
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	46	March 25, 2025

MAX on CPU doubt and request

Related topics