Unfortunately, the Blackwell series introduced a pretty strong divergence between the sm_100 family of GPUs (B200 / B300) and sm_12x (RTX 50XX series, GB10 on the DGX Spark). We’ve spent the bulk of our time optimizing the former for our enterprise workloads, and have only recently started enabling the basics for the latter. Many of the optimizations described in our Blackwell blog post series don’t apply to the sm_12x series, because they lack the hardware for them. They do require dedicated and different kernels.
That does mean that you’ll currently run into issues with specific kernels on that platform, where we haven’t yet worked through the proper platform checks or built hardware-specific kernels. I’ve started to aggregate reported issues for consumer Blackwell in a GitHub epic here: [Feature Request] [Epic] Extend support for NVIDIA sm_120 / sm_121 consumer Blackwell GPUs · Issue #6570 · modular/modular · GitHub to have a public central location for tracking progress and highlighting reported incompatibilities, but do need to have a few more issues there to identify known shortcomings. There are also a number of community-contributed fixes we’re a little behind on reviewing, and will try to get some of those landed to help expand compatibility.
I do hear you on the desire for published numbers and settings to use when comparing against common local LLM inference systems like llama.cpp. We have a little more work to do to expand compatibility on consumer GPUs and tune their performance, but when we’re ready I very much would like to show how MAX compares to llama.cpp on locally-run models. Again, our emphasis to date has been on large-scale deployments, driven by customer demand, which is why our published benchmarks and guides have been largely in that direction.
Thanks @BradLarson for follow-up. Great to hear your team is looking to prioritize support for consumer GPUs.
The biggest challenge right now for GPUs with Unified Memory is figuring out the right sequence of steps to get things working due to the challenges from unified memory management. It took me a whole to figure out how to get past the Compiler OOM on DGX Spark GB10 / Unified Memory machines even for 0.5b size models
It would be very helpful to put out a step-by-step guide for how to go about getting a medium sized model (20-40b) working with optimal settings - eg start by compiling the model, if it fails turn on these logs, or try eager mode, verify base kernels working on your gpu, for reference for a Llama8B model matmuls should take 20% of time, check FA is actually working for the GPU, check the quantization’s supported, check the model architecture support, if missing do this…
Looking forward to seeing the benefits from Modular Mojo to solve the kernel optimization issues that work across different GPUs.