Optimized Kernels for Blackwell -- do they work on GB10

Thanks @BradLarson for follow-up. Great to hear your team is looking to prioritize support for consumer GPUs.

The biggest challenge right now for GPUs with Unified Memory is figuring out the right sequence of steps to get things working due to the challenges from unified memory management. It took me a whole to figure out how to get past the Compiler OOM on DGX Spark GB10 / Unified Memory machines even for 0.5b size models

It would be very helpful to put out a step-by-step guide for how to go about getting a medium sized model (20-40b) working with optimal settings - eg start by compiling the model, if it fails turn on these logs, or try eager mode, verify base kernels working on your gpu, for reference for a Llama8B model matmuls should take 20% of time, check FA is actually working for the GPU, check the quantization’s supported, check the model architecture support, if missing do this…

Looking forward to seeing the benefits from Modular Mojo to solve the kernel optimization issues that work across different GPUs.