I’m a software engineer with a web dev and Python background, but I’ve always had an itch to understand systems at a deeper level. Consequently, Mojo and Modular’s goal of making GPU programming more accessible really grabbed my attention.
My first experience came when I initially wanted to run inference on a proper mxfp4 version of GPT-OSS using Max (rather than the bf16 version). I eventually got the 120b version working on a single H100 with a scalar-based kernel, but that was not a good beginners project. That was my only experience with Mojo and Max. While working on that, I came across an Unsloth recruitment notebook. It contained a challenge to implement NF4 in Triton under tight restrictions, with the goal of beating their implementation speed.
I decided to treat this as a personal challenge: could someone without prior GPU-kernel experience pick up Mojo and quickly build something competitive against highly optimized, hand-tuned kernels?
After roughly seven hours of work iterating, the results were surprising. I designed two different kernel variants, a 2D tiled kernel (1024 threads + packed store) and a Warp-per-NF4-block kernel (shared scale + packed store).
The time to beat was 5.33 seconds (Triton kernel on T4). On the T4, my 2D tiled kernel clocked in at 4.27 seconds. Counter-intuitively, the Warp-per-block kernel maxed out at 4.50 seconds despite in theory requiring less computation. The optimized Unsloth kernel still won out at 3.9 seconds.
I was semi satisfied with my results and ready to call it a day, but I decided to test on an L4 GPU. This is where things got interesting. On the L4, the 2D Tiled kernel finished in 2.63 seconds, and the Warp-per-block variant hit 2.48 seconds the winner was reversed, also Unsloth’s highly optimized version (written in CUDA and C++ using two kernels) clocked in at just over 3 seconds.
I was shocked to find that without writing a single line of C++ or CUDA, I achieved a result 20% faster with a single Mojo kernel than Unsloth’s dual-kernel optimized version. This experience in my opinion validates the claim that Mojo / Modular lowers the barrier to entry for GPU programming.
I wanted to share this here in hopes that others will read this and take the plunge. While AI assisted me with brainstorming and code explanations etc, it was the combination of Mojo’s familiarity, Modular’s tutorials, and my general coding experience that made this possible.
Here’s the notebook for anyone that is interested any feedback is welcome.