DeepSeek-OCR infer at CPU

Can I infer DeepSeek-OCR with Max, using CPU only?

In terms of built-in architectures in MAX, DeepseekV2ForCausalLM and DeepseekV3ForCausalLM are present, and are in active development. These architectures are oriented towards GPU serving, however. We have had no requests from anyone yet to run them on CPU, given their size.

The relatively new DeepSeek-OCR uses the DeepseekOCRForCausalLM architecture, which has some notable differences from the above. It’s possible to build it out in MAX, but the Modular team hasn’t yet created a version of it. Furthermore, to be practical on CPU, a heavily quantized model would need to be used and so far we’ve only optimized CPU-based quantization in the Llama family of models.

1 Like

Thanks for your feedback, appreciated.

Running DeepSeek-OCR on CPU only isn’t really practical for inference, especially with Max, because the model and its dependencies are designed around GPU acceleration and CUDA support — CPU-only execution leads to very slow performance or may not work properly at all. Most guides and community discussions indicate that DeepSeek-OCR is best run on an NVIDIA GPU setup with adequate VRAM for reasonable throughput, and CPU-only usage isn’t recommended for real-world tasks.

CPUs can get fairly high memory bandwidth, with AMD’s EPYC Turing being able to easily blow past all non-5090 consumer GPUs. Similarly, AVX512 does provide a very reasonable amount of flops when you consider that 256 cores of Zen 5 has the same vector width as 128 NVIDIA SMs, meaning that if you are compute bound (as many convnet models are), a large CPU is not a horrible choice vs a GPU, especially for higher batch counts. CPU instances on public cloud are often quite a bit cheaper as well, so sometimes the economics work out in the favor of CPU.