I’ve been building Tenmo to explore what a fully Mojo-native ML stack looks like. It implements tensors, autograd, broadcasting, slicing, layers, and training pipelines with explicit memory control and SIMD.
On MNIST, it trains ~1.3× faster than PyTorch CPU on the same machine (no BLAS, no Python overhead).
Thanks so much, Chris! That means a lot coming from you.
You’re absolutely right — there’s enormous room for improvement.
I’m currently working through matmul optimization and realizing this is deeper than I initially thought.
Current state:
Small matrices (≤128): 100+ GFLOPS (competitive)
Large matrices (≥256): 12-24 GFLOPS (PyTorch/MKL gets 200+)
Using single-level tiling + SIMD + parallelization
This is new territory for me — I come from application dev rather than
systems optimization. Multi-level cache blocking seems like the next step,
but I’d love any high-level pointers on whether I’m headed in the right
direction or missing something fundamental.
I’ve been avoiding compile-time rank specialization (à la LayoutTensor) to keep the API flexible, but I know that’s likely necessary for GPU work down the line.
That said, any direction or pointers you think would be most impactful, I’d love to explore and implement.
Perfectly stated you did an awesome job. Finally a full Machine learning library developed purely on Mojo.
Tenmo is quite awesome for machine learning so that we don’t rely on importing modules on python to build AI models.
Ratulb has built the entire stack—from how data is stored (tensors) to how the math is calculated—directly in Mojo. Key technical features include:
Autograd: A system that automatically calculates gradients (essential for training AI).
SIMD (Single Instruction, Multiple Data): This allows the CPU to perform the same operation on multiple data points simultaneously, which is why it’s so fast.
Explicit Memory Control: Unlike Python, which manages memory for you (often slowly), Tenmo gives the developer control, similar to Rust.
Thanks for the kind words — it’s motivating to hear Tenmo is resonating!
Tenmo is just coming out of the proof-of-concept phase. So far, the focus has been on validating the end-to-end flow on CPU while keeping things lightweight, explicit, and performant. The results suggest the approach is sound. Going forward, it’s about optimization and expanding capabilities (reworking things where necessary).
The GPU path is next, but it introduces an interesting design tension. Mojo’s LayoutTensor is incredibly powerful for compile-time optimization — exactly what production kernels need for peak performance — while Tenmo aims to remain dynamic and flexible, PyTorch-style.
If my hunch is right, the work ahead involves carefully blending a dynamic Shape with Mojo’s compile-time Layout. Layouts are extremely powerful, but their compile-time nature introduces friction when designing runtime-friendly APIs. Still exploring whether traits or other patterns can help bridge that gap.
All of this is possible because of Mojo’s vision to democratize high-performance computing. The fact that an average developer can even attempt something like this is a strong signal that the vision is working.
For now, Tenmo is about learning, experimentation, and exploration — and I’m excited to keep pushing it forward.
"Awesome work on Tenmo! I’m currently working on an AI-driven intrusion prevention system for a capstone project and we’ve been looking at Mojo specifically for this kind of ‘pure’ performance.
The tension between dynamic shapes and compile-time LayoutTensor is a fascinating hurdle. Have you considered using Mojo Traits to create a generic interface that abstracts the Layout? Would love to follow the progress as you move toward the GPU path!"
The GPU path that was “next” - the initial cut is in. But getting there required rethinking some fundamentals first.
Autograd has been redesigned from the ground up. The backward system moved from stateful handler instances to pure static methods dispatched via an integer-tag type-erased BackwardFnArg jump table — no variant extraction, no redundant copies. Ancestry no longer stores full Tensor copies; each ancestor is now a lightweight handle carrying only what backward actually needs: an id, a gradbox pointer, a Layout, and a Storage(ref counted). The recursive deep-copy explosion on every add_ancestry call is gone.
GPU support is in. Tensor operations, backward passes, and gradient flow all work on GPU. Gradient flow respects device boundaries — a new stop_grad flag on to_gpu() and to_cpu() lets you control exactly where gradients stop, which makes GPU-native training loops clean and efficient.
For anyone curious about how the forward and backward pass actually work under the hood — the ancestry system, the type-erased BackwardFnArg, the gradbox refcount, and the full CPU<->GPU grad flow rules — I’ve written it all up here:
The foundation feels solid. GPU throughput optimization — kernel tuning, data transfer overhead — is the active front now. Still a lot of road ahead(picking up all Mojo GPU goodies…), and would love any feedback!
One heads-up: compile times are long. Worth knowing before you dive in.
Awesome build on the Auto-Grad System. While I accept your Warning . Hey Ratulb do you have any benchmarks to justify if I can really hit an OOM crash on my Fedora Linux Machine?
With Tenmo’s parametric types (Tensor[dtype], NDBuffer[dtype]) instantiated across multiple dtypes, plus all the op specializations, compilation peaks around 7GB RAM and stabilizes there.
Practical guidance:
8GB minimum RAM — you’ll be tight, swap will kick in
16GB recommended — comfortable headroom
Runtime memory is fine — tensor operations are lean, no Python overhead, no hidden copies beyond what the autograd graph needs
The OOM risk is real on 8GB machines during compilation of large test suites — not during training itself. If you’re hitting OOM, try:
Running individual test files rather than the full suite
Closing other applications during compilation
The compiled binary runs fine once built
Happy to share more specific numbers as I profile further — rigorous memory benchmarks are on the roadmap. I stretch - “you will age while compilation runs”(pun intended!)
Oh my Linux set-up has a total of 96GB of Memory 12 time’s higher than your estimate. So this dude got no worries it’s just that sometimes I kinda run applications that use up to 68GB alone adding background processes and applications which might lead to an OOM.
However I have external NPU Memory which is 32GB and 16GB of GPU Memory I also have 98MB of SRAM.
How did you handle Explicit memory control on Tenmo?