Optimized Kernels for Blackwell -- do they work on GB10

I am seeing very low perf numbers for model inference on Nvidia DGX Spark GB10 with Modular mojo / max serve. What parts from the Blackwell kernel series ( Modular: Matrix Multiplication on Blackwell ) are available for consumer gpus?

Challenges observed:

  • Flash Attention not going through optimized path falls back to slow generic path
  • Quantized models support for GPU, FP8, NVFP4 support not working, qx_y only supports CPUs and not on GPUs. Could only get bf16 to work, limits any large models from running on GB10 with Max server
    Quantization encodings on sm_121 GPU
Encoding sm_121 GPU?
bfloat16 :white_check_mark: only working GPU path
float8_e4m3fn (fp8) :cross_mark: SM100-gated
float4_e2m1fnx2 (fp4 / NVFP4) :cross_mark: SM100-gated
q4_k / q6_k (GGUF) :cross_mark: CPU-only
GPTQ :warning: GPU-capable but Llama-arch only
  • For Gemma4-31B-it got the bf16 model working with max server, but only seeing <2 tok/s vs 100+tok/s on llama-cpp (with mtp for GGUF quantized versions) on same GB10. Something sounds clearly off, must be missing right settings for Max Serve / mojo compile/autotune.

It would be great to have an optimization guide for running large models locally, what knobs are available for tuning, quantized models that work on GPUs, some published reference numbers for key models.

Unfortunately, the Blackwell series introduced a pretty strong divergence between the sm_100 family of GPUs (B200 / B300) and sm_12x (RTX 50XX series, GB10 on the DGX Spark). We’ve spent the bulk of our time optimizing the former for our enterprise workloads, and have only recently started enabling the basics for the latter. Many of the optimizations described in our Blackwell blog post series don’t apply to the sm_12x series, because they lack the hardware for them. They do require dedicated and different kernels.

That does mean that you’ll currently run into issues with specific kernels on that platform, where we haven’t yet worked through the proper platform checks or built hardware-specific kernels. I’ve started to aggregate reported issues for consumer Blackwell in a GitHub epic here: [Feature Request] [Epic] Extend support for NVIDIA sm_120 / sm_121 consumer Blackwell GPUs · Issue #6570 · modular/modular · GitHub to have a public central location for tracking progress and highlighting reported incompatibilities, but do need to have a few more issues there to identify known shortcomings. There are also a number of community-contributed fixes we’re a little behind on reviewing, and will try to get some of those landed to help expand compatibility.

I do hear you on the desire for published numbers and settings to use when comparing against common local LLM inference systems like llama.cpp. We have a little more work to do to expand compatibility on consumer GPUs and tune their performance, but when we’re ready I very much would like to show how MAX compares to llama.cpp on locally-run models. Again, our emphasis to date has been on large-scale deployments, driven by customer demand, which is why our published benchmarks and guides have been largely in that direction.

Thanks @BradLarson for follow-up. Great to hear your team is looking to prioritize support for consumer GPUs.

The biggest challenge right now for GPUs with Unified Memory is figuring out the right sequence of steps to get things working due to the challenges from unified memory management. It took me a whole to figure out how to get past the Compiler OOM on DGX Spark GB10 / Unified Memory machines even for 0.5b size models

It would be very helpful to put out a step-by-step guide for how to go about getting a medium sized model (20-40b) working with optimal settings - eg start by compiling the model, if it fails turn on these logs, or try eager mode, verify base kernels working on your gpu, for reference for a Llama8B model matmuls should take 20% of time, check FA is actually working for the GPU, check the quantization’s supported, check the model architecture support, if missing do this…

Looking forward to seeing the benefits from Modular Mojo to solve the kernel optimization issues that work across different GPUs.