I’m a Mojo newbie... So I decided to Rewrite Unsloth’s NF4 Kernel To Validate Modular's Mission

daver987 · December 6, 2025, 2:57am

I’m a software engineer with a web dev and Python background, but I’ve always had an itch to understand systems at a deeper level. Consequently, Mojo and Modular’s goal of making GPU programming more accessible really grabbed my attention.

My first experience came when I initially wanted to run inference on a proper mxfp4 version of GPT-OSS using Max (rather than the bf16 version). I eventually got the 120b version working on a single H100 with a scalar-based kernel, but that was not a good beginners project. That was my only experience with Mojo and Max. While working on that, I came across an Unsloth recruitment notebook. It contained a challenge to implement NF4 in Triton under tight restrictions, with the goal of beating their implementation speed.

I decided to treat this as a personal challenge: could someone without prior GPU-kernel experience pick up Mojo and quickly build something competitive against highly optimized, hand-tuned kernels?

After roughly seven hours of work iterating, the results were surprising. I designed two different kernel variants, a 2D tiled kernel (1024 threads + packed store) and a Warp-per-NF4-block kernel (shared scale + packed store).

The time to beat was 5.33 seconds (Triton kernel on T4). On the T4, my 2D tiled kernel clocked in at 4.27 seconds. Counter-intuitively, the Warp-per-block kernel maxed out at 4.50 seconds despite in theory requiring less computation. The optimized Unsloth kernel still won out at 3.9 seconds.

I was semi satisfied with my results and ready to call it a day, but I decided to test on an L4 GPU. This is where things got interesting. On the L4, the 2D Tiled kernel finished in 2.63 seconds, and the Warp-per-block variant hit 2.48 seconds the winner was reversed, also Unsloth’s highly optimized version (written in CUDA and C++ using two kernels) clocked in at just over 3 seconds.

I was shocked to find that without writing a single line of C++ or CUDA, I achieved a result 20% faster with a single Mojo kernel than Unsloth’s dual-kernel optimized version. This experience in my opinion validates the claim that Mojo / Modular lowers the barrier to entry for GPU programming.

I wanted to share this here in hopes that others will read this and take the plunge. While AI assisted me with brainstorming and code explanations etc, it was the combination of Mojo’s familiarity, Modular’s tutorials, and my general coding experience that made this possible.

Here’s the notebook for anyone that is interested any feedback is welcome.

ephemer · December 20, 2025, 11:09pm

Impressive stuff, well done!

Ehsan · December 21, 2025, 1:47am

That’s fantastic!

clattner · December 21, 2025, 4:33am

I’m thrilled to hear about your experience, let us know if you’d be willing to contribute a blog post about your learnings!

daver987 · December 23, 2025, 10:41am

Hi, I’d love to write a blog post about what I learned and the journey so far.

It’s become something I can’t put down. I was unsatisfied that my setup wasn’t as fast as the CUDA/C++ path on the T4, so I had to revisit it. I dug in and learned more about the T4 and the why there likely was a gap, I’ll save the details for the post, but I’ve been a bit obsessed but in a good way.

AI helped a lot with explanations and brainstorming, and I also picked up some interesting insights on learning with AI specifically in the context of GPU programming, where it tripped me up in unintuitive ways, and where it really shined.

I love Modular’s mission, and it would be an honour to contribute.

Topic		Replies	Views
Modular: Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥 Content blog	0	82	May 20, 2025
Modular: Modverse #48: Modular Platform 25.3, MAX AI Kernels, and the Modular GPU Kernel Hackathon Content blog	0	107	May 29, 2025
The Modular GPU Kernel Hackathon highlight reel just dropped! Content	0	64	May 12, 2025
Llm.🔥: GPT-2 training in pure Mojo, with hand-written CUDA and Metal GPU kernels Community Showcase discussion , gpu , mojo-compiler	2	188	July 12, 2026
Triton_lite, a Triton clone in Mojo: Jeff Niu at the Modular GPU Kernel Hackathon Content youtube	0	67	May 14, 2025

I’m a Mojo newbie... So I decided to Rewrite Unsloth’s NF4 Kernel To Validate Modular's Mission

Related topics