I’ve been building Tenmo to explore what a fully Mojo-native ML stack looks like. It implements tensors, autograd, broadcasting, slicing, layers, and training pipelines with explicit memory control and SIMD.
On MNIST, it trains ~1.3× faster than PyTorch CPU on the same machine (no BLAS, no Python overhead).
Thanks so much, Chris! That means a lot coming from you.
You’re absolutely right — there’s enormous room for improvement.
I’m currently working through matmul optimization and realizing this is deeper than I initially thought.
Current state:
Small matrices (≤128): 100+ GFLOPS (competitive)
Large matrices (≥256): 12-24 GFLOPS (PyTorch/MKL gets 200+)
Using single-level tiling + SIMD + parallelization
This is new territory for me — I come from application dev rather than
systems optimization. Multi-level cache blocking seems like the next step,
but I’d love any high-level pointers on whether I’m headed in the right
direction or missing something fundamental.
I’ve been avoiding compile-time rank specialization (à la LayoutTensor) to keep the API flexible, but I know that’s likely necessary for GPU work down the line.
That said, any direction or pointers you think would be most impactful, I’d love to explore and implement.
Perfectly stated you did an awesome job. Finally a full Machine learning library developed purely on Mojo.
Tenmo is quite awesome for machine learning so that we don’t rely on importing modules on python to build AI models.
Ratulb has built the entire stack—from how data is stored (tensors) to how the math is calculated—directly in Mojo. Key technical features include:
Autograd: A system that automatically calculates gradients (essential for training AI).
SIMD (Single Instruction, Multiple Data): This allows the CPU to perform the same operation on multiple data points simultaneously, which is why it’s so fast.
Explicit Memory Control: Unlike Python, which manages memory for you (often slowly), Tenmo gives the developer control, similar to Rust.
Thanks for the kind words — it’s motivating to hear Tenmo is resonating!
Tenmo is just coming out of the proof-of-concept phase. So far, the focus has been on validating the end-to-end flow on CPU while keeping things lightweight, explicit, and performant. The results suggest the approach is sound. Going forward, it’s about optimization and expanding capabilities (reworking things where necessary).
The GPU path is next, but it introduces an interesting design tension. Mojo’s LayoutTensor is incredibly powerful for compile-time optimization — exactly what production kernels need for peak performance — while Tenmo aims to remain dynamic and flexible, PyTorch-style.
If my hunch is right, the work ahead involves carefully blending a dynamic Shape with Mojo’s compile-time Layout. Layouts are extremely powerful, but their compile-time nature introduces friction when designing runtime-friendly APIs. Still exploring whether traits or other patterns can help bridge that gap.
All of this is possible because of Mojo’s vision to democratize high-performance computing. The fact that an average developer can even attempt something like this is a strong signal that the vision is working.
For now, Tenmo is about learning, experimentation, and exploration — and I’m excited to keep pushing it forward.
"Awesome work on Tenmo! I’m currently working on an AI-driven intrusion prevention system for a capstone project and we’ve been looking at Mojo specifically for this kind of ‘pure’ performance.
The tension between dynamic shapes and compile-time LayoutTensor is a fascinating hurdle. Have you considered using Mojo Traits to create a generic interface that abstracts the Layout? Would love to follow the progress as you move toward the GPU path!"