Optimized Kernels for Blackwell -- do they work on GB10

gchauhan · June 22, 2026, 4:46pm

Thanks @BradLarson for follow-up. Great to hear your team is looking to prioritize support for consumer GPUs.

The biggest challenge right now for GPUs with Unified Memory is figuring out the right sequence of steps to get things working due to the challenges from unified memory management. It took me a whole to figure out how to get past the Compiler OOM on DGX Spark GB10 / Unified Memory machines even for 0.5b size models

It would be very helpful to put out a step-by-step guide for how to go about getting a medium sized model (20-40b) working with optimal settings - eg start by compiling the model, if it fails turn on these logs, or try eager mode, verify base kernels working on your gpu, for reference for a Llama8B model matmuls should take 20% of time, check FA is actually working for the GPU, check the quantization’s supported, check the model architecture support, if missing do this…

Looking forward to seeing the benefits from Modular Mojo to solve the kernel optimization issues that work across different GPUs.

Topic		Replies	Views
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	71	March 25, 2025
MAX Model Repository MAX	3	136	August 6, 2025
Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul Content blog	2	142	September 6, 2025
Compiler OOM on DGX Spark GB10 / Unified Memory machines even for 0.5b size models General mojo-compiler	0	34	June 21, 2026
Where are the SOTA quantized models? General	6	224	July 29, 2025

Optimized Kernels for Blackwell -- do they work on GB10

Related topics