Optimized Kernels for Blackwell -- do they work on GB10

gchauhan · June 22, 2026, 1:58am

I am seeing very low perf numbers for model inference on Nvidia DGX Spark GB10 with Modular mojo / max serve. What parts from the Blackwell kernel series ( Modular: Matrix Multiplication on Blackwell ) are available for consumer gpus?

Challenges observed:

Flash Attention not going through optimized path falls back to slow generic path
Quantized models support for GPU, FP8, NVFP4 support not working, qx_y only supports CPUs and not on GPUs. Could only get bf16 to work, limits any large models from running on GB10 with Max server
Quantization encodings on sm_121 GPU

Encoding	sm_121 GPU?
`bfloat16`	only working GPU path
`float8_e4m3fn` (fp8)	SM100-gated
`float4_e2m1fnx2` (fp4 / NVFP4)	SM100-gated
`q4_k` / `q6_k` (GGUF)	CPU-only
GPTQ	GPU-capable but Llama-arch only

For Gemma4-31B-it got the bf16 model working with max server, but only seeing <2 tok/s vs 100+tok/s on llama-cpp (with mtp for GGUF quantized versions) on same GB10. Something sounds clearly off, must be missing right settings for Max Serve / mojo compile/autotune.

It would be great to have an optimization guide for running large models locally, what knobs are available for tuning, quantized models that work on GPUs, some published reference numbers for key models.

BradLarson · June 22, 2026, 3:55pm

Unfortunately, the Blackwell series introduced a pretty strong divergence between the sm_100 family of GPUs (B200 / B300) and sm_12x (RTX 50XX series, GB10 on the DGX Spark). We’ve spent the bulk of our time optimizing the former for our enterprise workloads, and have only recently started enabling the basics for the latter. Many of the optimizations described in our Blackwell blog post series don’t apply to the sm_12x series, because they lack the hardware for them. They do require dedicated and different kernels.

That does mean that you’ll currently run into issues with specific kernels on that platform, where we haven’t yet worked through the proper platform checks or built hardware-specific kernels. I’ve started to aggregate reported issues for consumer Blackwell in a GitHub epic here: [Feature Request] [Epic] Extend support for NVIDIA sm_120 / sm_121 consumer Blackwell GPUs · Issue #6570 · modular/modular · GitHub to have a public central location for tracking progress and highlighting reported incompatibilities, but do need to have a few more issues there to identify known shortcomings. There are also a number of community-contributed fixes we’re a little behind on reviewing, and will try to get some of those landed to help expand compatibility.

I do hear you on the desire for published numbers and settings to use when comparing against common local LLM inference systems like llama.cpp. We have a little more work to do to expand compatibility on consumer GPUs and tune their performance, but when we’re ready I very much would like to show how MAX compares to llama.cpp on locally-run models. Again, our emphasis to date has been on large-scale deployments, driven by customer demand, which is why our published benchmarks and guides have been largely in that direction.

gchauhan · June 22, 2026, 4:46pm

Thanks @BradLarson for follow-up. Great to hear your team is looking to prioritize support for consumer GPUs.

The biggest challenge right now for GPUs with Unified Memory is figuring out the right sequence of steps to get things working due to the challenges from unified memory management. It took me a whole to figure out how to get past the Compiler OOM on DGX Spark GB10 / Unified Memory machines even for 0.5b size models

It would be very helpful to put out a step-by-step guide for how to go about getting a medium sized model (20-40b) working with optimal settings - eg start by compiling the model, if it fails turn on these logs, or try eager mode, verify base kernels working on your gpu, for reference for a Llama8B model matmuls should take 20% of time, check FA is actually working for the GPU, check the quantization’s supported, check the model architecture support, if missing do this…

Looking forward to seeing the benefits from Modular Mojo to solve the kernel optimization issues that work across different GPUs.

Topic		Replies	Views
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	71	March 25, 2025
MAX Model Repository MAX	3	136	August 6, 2025
Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul Content blog	2	142	September 6, 2025
Compiler OOM on DGX Spark GB10 / Unified Memory machines even for 0.5b size models General mojo-compiler	0	45	June 21, 2026
Where are the SOTA quantized models? General	6	224	July 29, 2025

Optimized Kernels for Blackwell -- do they work on GB10

Related topics