MAX on CPU doubt and request

As far as I understand (feel free to correct me) most AI inference workloads specially LLMS are memory bound so running inference on CPU can also be a great idea. So, I thought it would be great to also have a MAX server container for CPU.

I’ve also been trying to run some models with MAX using CPU using the modular/max-openai-api container and just removing the gpu option in the entrypoint.sh and although I’ve managed to run mixtral with it I can’t run the modularai/llama-3.1 model as in the GPU tutorial since once I make a request it crashes.

I don’t know if that is because the model only supports GPU (I’ve never heard of something like that) or if I may be doing something wrong.

1 Like

If you have reproducible steps for the Llama 3.1 model crashing on CPU, would you mind sharing them here or in a GitHub issue? MAX fully supports running efficiently on CPUs as well as GPUs, and there’s nothing GPU-specific about any of our models. I will say that bfloat16 weights will only work on GPU and on Intel CPUs, because ARM instructions are currently missing for bfloat16 operations.

When running on CPU, I highly recommend looking into using quantized weights. We support q4_0, q4_k, and q6_k quantization schemes, and MAX delivers state-of-the-art performance for many quantized models on CPU. Our server container should support this by removing the --use-gpu flag from the model and adding --quantization-encoding q4_k when pointing at modularai/llama-3.1 or another Hugging Face repository that has quantized weights in GGUF format.

In addition to using our container, you can set up Magic on your local machine, clone our MAX repository, and follow these steps serve a model on CPU or GPU using Magic and MAX.

2 Likes

Thanks for the recommendation, I will keep playing around with it!. I’ve also opened a Github issue as you requested. I tried with both AMD and Intel chips to check if it was the problem but both gave me the same error.

1 Like