mojoBLAS (v0.1.0) A pure Mojo implementation of BLAS routines

Photon · April 22, 2026, 5:42pm

mojoBLAS: A pure Mojo implementation of BLAS routines. (yes, peak naming creativity )

I started this while working on numerical backends for NuMojo and ended up going down the rabbit hole of implementing the full BLAS routines in Mojo. This will later be embedded into my existing work on SciJo.

Current coverage

Level 1: 12 routines
Level 2: 16 routines
Level 3: 6 routines

What’s included

Pure Mojo kernels: no external BLAS dependencies in the core implementation.
Generic support for real data types via DType
- Traditional s/d prefixes are removed since Mojo handles this generically
Test coverage across all levels
- Results validated against OpenBLAS
Benchmark scripts comparing performance with system OpenBLAS

What’s not included (yet)

Optimized routines
- Level 2 and Level 3 currently use naive implementations (this is reflected in benchmarks xD)
- Some of Level 1 routines include SIMD vectorisation.
Complex number support
- Still exploring the best way to represent and handle complex types generically in Mojo.

Check out the benchmark plots here. There’s plenty of room for optimization, from low hanging improvements to more hardcore tuning. Contributions are very welcome!!! If you are interesting in going down the optimisations rabbit hole, I’ll be happy to take PRs. Also, I’m not super confident in my benchmarking abilities, so if that’s your thing, feel free to take a crack at it

Happy computing!

Repo: GitHub - shivasankarka/mojoBLAS: Implementation of BLAS routines in pure Mojo 🔥 · GitHub
Reference: https://www.netlib.org/blas/

ephemer · April 26, 2026, 6:11pm

Cool stuff, love the direction. There was one Level 2 and one Level 3 routine which seemed to be an order of magnitude faster than Accelerate. Is that a bug or how is that possible do you think?

Photon · April 27, 2026, 5:21am

Thanks! Great catch!

The initial benchmark results were a bit misleading due to few subtle bugs in the benchmarking setup, not the algorithms themselves. I have been trying to improve the performance, which led me to find these bugs in benchmarking.

I had added keep() inside the closure function, which somehow caused the compiler to optimize it all away, resulting in inflated values.
I wasn’t filling up the relevant buffers in the correct manner (in mojo, openblas and accelerate). This led to some weird jumps in performance in openblas and mojo (GPT5.3 found this).

After fixing these bugs, I was able to get a reliable (Assuming I didn’t make any other mistakes lol) benchmark code working, which is what I using right now to benchmark the performance as I try to improve these routines.

I have updated the plots in the main branch! The current results make much more sense and seem consistent. Mojo performs good enough given level 2 and 3 are naive algorithms without any optimisation (in main branch). I will drop an update once I improve performance on level 2 algorithms.

Topic		Replies	Views
Mojo Q3 Roadmap Update Official Announcements	2	1599	July 20, 2025
A Benchmark with Files and Bytes (standard benchmark warnings apply) Community Showcase discussion	10	343	July 5, 2025
Code review for scientific code that already beats Fortran :) Performance	20	217	April 2, 2026
I have discovered a suspect efficiency anomaly in the mojo compiler, how to proceed? Mojo discussion , mojo-compiler , 25_1	20	415	March 8, 2025
CPU benchmark finding: Mojo vs Numba sensitive to default thread/runtime behavior — best practices for Mojo defaults? Mojo	8	167	March 30, 2026

mojoBLAS (v0.1.0) A pure Mojo implementation of BLAS routines

Current coverage

What’s included

What’s not included (yet)

Related topics