Tenmo — A lean tensor + NN library in pure Mojo

ratulb · January 17, 2026, 9:04am

I’ve been building Tenmo to explore what a fully Mojo-native ML stack looks like. It implements tensors, autograd, broadcasting, slicing, layers, and training pipelines with explicit memory control and SIMD.

On MNIST, it trains ~1.3× faster than PyTorch CPU on the same machine (no BLAS, no Python overhead).

I’d love feedback on everything!

Repo: GitHub - ratulb/tenmo: A fast, lean and from-scratch Tensor library built in Mojo🔥 · GitHub

clattner · January 17, 2026, 11:36pm

Very nice! There are so many ways to improve the current tech stack!

ratulb · January 18, 2026, 3:05am

Thanks so much, Chris! That means a lot coming from you.
You’re absolutely right — there’s enormous room for improvement.

I’m currently working through matmul optimization and realizing this is deeper than I initially thought.

Current state:

Small matrices (≤128): 100+ GFLOPS (competitive)
Large matrices (≥256): 12-24 GFLOPS (PyTorch/MKL gets 200+)
Using single-level tiling + SIMD + parallelization

This is new territory for me — I come from application dev rather than
systems optimization. Multi-level cache blocking seems like the next step,
but I’d love any high-level pointers on whether I’m headed in the right
direction or missing something fundamental.

I’ve been avoiding compile-time rank specialization (à la LayoutTensor) to keep the API flexible, but I know that’s likely necessary for GPU work down the line.

That said, any direction or pointers you think would be most impactful, I’d love to explore and implement.

trojan_x · January 19, 2026, 8:01pm

Perfectly stated you did an awesome job. Finally a full Machine learning library developed purely on Mojo.

Tenmo is quite awesome for machine learning so that we don’t rely on importing modules on python to build AI models.

Ratulb has built the entire stack—from how data is stored (tensors) to how the math is calculated—directly in Mojo. Key technical features include:

Autograd: A system that automatically calculates gradients (essential for training AI).

SIMD (Single Instruction, Multiple Data): This allows the CPU to perform the same operation on multiple data points simultaneously, which is why it’s so fast.

Explicit Memory Control: Unlike Python, which manages memory for you (often slowly), Tenmo gives the developer control, similar to Rust.

ratulb · January 22, 2026, 11:06am

Thanks for the kind words — it’s motivating to hear Tenmo is resonating!

Tenmo is just coming out of the proof-of-concept phase. So far, the focus has been on validating the end-to-end flow on CPU while keeping things lightweight, explicit, and performant. The results suggest the approach is sound. Going forward, it’s about optimization and expanding capabilities (reworking things where necessary).

The GPU path is next, but it introduces an interesting design tension. Mojo’s LayoutTensor is incredibly powerful for compile-time optimization — exactly what production kernels need for peak performance — while Tenmo aims to remain dynamic and flexible, PyTorch-style.

If my hunch is right, the work ahead involves carefully blending a dynamic Shape with Mojo’s compile-time Layout. Layouts are extremely powerful, but their compile-time nature introduces friction when designing runtime-friendly APIs. Still exploring whether traits or other patterns can help bridge that gap.

All of this is possible because of Mojo’s vision to democratize high-performance computing. The fact that an average developer can even attempt something like this is a strong signal that the vision is working.

For now, Tenmo is about learning, experimentation, and exploration — and I’m excited to keep pushing it forward.

trojan_x · January 22, 2026, 4:03pm

"Awesome work on Tenmo! I’m currently working on an AI-driven intrusion prevention system for a capstone project and we’ve been looking at Mojo specifically for this kind of ‘pure’ performance.

The tension between dynamic shapes and compile-time LayoutTensor is a fascinating hurdle. Have you considered using Mojo Traits to create a generic interface that abstracts the Layout? Would love to follow the progress as you move toward the GPU path!"

trojan_x · January 22, 2026, 4:23pm

Well then if the GPU ís next then do you have plans and strategies to build Microkernels using Mojo in Tenmo or expansion of the ecosystem

ratulb · May 1, 2026, 12:15pm

A lot has happened since that last message.

The GPU path that was “next” - the initial cut is in. But getting there required rethinking some fundamentals first.

Autograd has been redesigned from the ground up. The backward system moved from stateful handler instances to pure static methods dispatched via an integer-tag type-erased BackwardFnArg jump table — no variant extraction, no redundant copies. Ancestry no longer stores full Tensor copies; each ancestor is now a lightweight handle carrying only what backward actually needs: an id, a gradbox pointer, a Layout, and a Storage(ref counted). The recursive deep-copy explosion on every add_ancestry call is gone.

GPU support is in. Tensor operations, backward passes, and gradient flow all work on GPU. Gradient flow respects device boundaries — a new stop_grad flag on to_gpu() and to_cpu() lets you control exactly where gradients stop, which makes GPU-native training loops clean and efficient.

For anyone curious about how the forward and backward pass actually work under the hood — the ancestry system, the type-erased BackwardFnArg, the gradbox refcount, and the full CPU<->GPU grad flow rules — I’ve written it all up here:

tenmo/README_AUTOGRAD.md at main · ratulb/tenmo · GitHub

The foundation feels solid. GPU throughput optimization — kernel tuning, data transfer overhead — is the active front now. Still a lot of road ahead(picking up all Mojo GPU goodies…), and would love any feedback!

One heads-up: compile times are long. Worth knowing before you dive in.

trojan_x · May 4, 2026, 5:36pm

Awesome build on the Auto-Grad System. While I accept your Warning . Hey Ratulb do you have any benchmarks to justify if I can really hit an OOM crash on my Fedora Linux Machine?

ratulb · May 5, 2026, 1:26pm

On memory — Here’s what I observe:

With Tenmo’s parametric types (Tensor[dtype], NDBuffer[dtype]) instantiated across multiple dtypes, plus all the op specializations, compilation peaks around 7GB RAM and stabilizes there.

Practical guidance:

8GB minimum RAM — you’ll be tight, swap will kick in
16GB recommended — comfortable headroom
Runtime memory is fine — tensor operations are lean, no Python overhead, no hidden copies beyond what the autograd graph needs

The OOM risk is real on 8GB machines during compilation of large test suites — not during training itself. If you’re hitting OOM, try:

Running individual test files rather than the full suite
Closing other applications during compilation
The compiled binary runs fine once built

Happy to share more specific numbers as I profile further — rigorous memory benchmarks are on the roadmap. I stretch - “you will age while compilation runs”(pun intended!)

trojan_x · May 7, 2026, 1:52pm

Oh my Linux set-up has a total of 96GB of Memory 12 time’s higher than your estimate. So this dude got no worries it’s just that sometimes I kinda run applications that use up to 68GB alone adding background processes and applications which might lead to an OOM.

However I have external NPU Memory which is 32GB and 16GB of GPU Memory I also have 98MB of SRAM.

How did you handle Explicit memory control on Tenmo?

ratulb · July 10, 2026, 8:01pm

Sorry for the silence — turns out the honest answer to this took a full writeup to get right. Tenmo’s memory control comes down to two structures:

Buffer — the flat memory backing every tensor. Unshared by default (deep copy on __init__(*, copy:) — malloc + memcpy). Calling .shared() transforms it in place into a refcounted layout: [Atomic(UInt64) refcount] | [data]. Views (slices, reshapes, transposes) bump the refcount instead of copying; the buffer is only freed once the last reference drops to zero.

Gradbox — same idea, applied to gradients, with its own independent refcount. This is what keeps a gradient alive even after Mojo’s aggressive destruction drops the intermediate Tensor it was attached to — without that, you’d get dangling pointers in the backward graph the moment an intermediate tensor goes out of scope.

Net effect: no hidden allocations beyond what the autograd graph structurally needs, and reshape() is genuinely free — new Shape/Strides on the same Buffer, no copy.

I traced this entire path — one MNIST training step, Buffer → NDBuffer → Gradbox → backward DFS → SGD update — in more detail than fits in a forum reply:

(Also mirrored as a repo doc if you’d rather stay on GitHub: docs/from-bytes-to-gradients.md)

Picked up a decent number while writing it up too: 2.8× faster than PyTorch CPU on the same MNIST MLP — no BLAS, no Python. Full benchmark table’s in the post.

Topic		Replies	Views
Llm.🔥: GPT-2 training in pure Mojo, with hand-written CUDA and Metal GPU kernels Community Showcase discussion , gpu , mojo-compiler	2	222	July 12, 2026
Custom MultiHead Self Attention Transformer Training Phase using AMD RX 9070 XT 16GB. Python/Pythorch Vs Mojo Performance performance , gpu_puzzle	6	202	February 7, 2026
High performance, fixed size 1D and 2D arrays on a CPU General	4	223	July 29, 2025
CPU benchmark finding: Mojo vs Numba sensitive to default thread/runtime behavior — best practices for Mojo defaults? Mojo	8	268	March 30, 2026
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1770	April 9, 2025

Tenmo — A lean tensor + NN library in pure Mojo

Related topics