On DGX Spark GB10 and Unified memory machines, the default settings for MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=90 result in compiler OOMing even for very small models like Qwen/Qwen2.5-0.5B-Instruct.
After spending a whole day trying various debug settings, code paths, finally figured out the following works.
Change to a lower value like export MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=15 to get things to work. Memory spike dropped from 108G → 33G and model loaded with correct output in 96s.
Working example:
docker run -d --gpus=all \ -e HF_HUB_OFFLINE=1 -e HF_HOME=/hf \ -e MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=15 \ -v ~/.cache/huggingface:/hf -p 8000:8000 \ ``docker.modular.com/modular/max-nvidia-full:nightly`` \
--model Qwen/Qwen2.5-0.5B-Instruct --devices gpu:0 \
--max-length 2048 --max-batch-size 1
Hope it helps and saves you time from chasing down different rabbit holes.