mojoBLAS (v0.1.0) A pure Mojo implementation of BLAS routines

mojoBLAS: A pure Mojo implementation of BLAS routines. (yes, peak naming creativity :wink:)

I started this while working on numerical backends for NuMojo and ended up going down the rabbit hole of implementing the full BLAS routines in Mojo. This will later be embedded into my existing work on SciJo.

Current coverage

  • Level 1: 12 routines
  • Level 2: 16 routines
  • Level 3: 6 routines

What’s included

  • Pure Mojo kernels: no external BLAS dependencies in the core implementation.
  • Generic support for real data types via DType
    • Traditional s/d prefixes are removed since Mojo handles this generically
  • Test coverage across all levels
    • Results validated against OpenBLAS
  • Benchmark scripts comparing performance with system OpenBLAS

What’s not included (yet)

  • Optimized routines
    • Level 2 and Level 3 currently use naive implementations (this is reflected in benchmarks xD)
    • Some of Level 1 routines include SIMD vectorisation.
  • Complex number support
    • Still exploring the best way to represent and handle complex types generically in Mojo.

Check out the benchmark plots here. There’s plenty of room for optimization, from low hanging improvements to more hardcore tuning. Contributions are very welcome!!! If you are interesting in going down the optimisations rabbit hole, I’ll be happy to take PRs. Also, I’m not super confident in my benchmarking abilities, so if that’s your thing, feel free to take a crack at it :slight_smile:

Happy computing!

Repo: GitHub - shivasankarka/mojoBLAS: Implementation of BLAS routines in pure Mojo 🔥 · GitHub
Reference: https://www.netlib.org/blas/

Cool stuff, love the direction. There was one Level 2 and one Level 3 routine which seemed to be an order of magnitude faster than Accelerate. Is that a bug or how is that possible do you think?

Thanks! Great catch!

The initial benchmark results were a bit misleading due to few subtle bugs in the benchmarking setup, not the algorithms themselves. I have been trying to improve the performance, which led me to find these bugs in benchmarking.

  • I had added keep() inside the closure function, which somehow caused the compiler to optimize it all away, resulting in inflated values.
  • I wasn’t filling up the relevant buffers in the correct manner (in mojo, openblas and accelerate). This led to some weird jumps in performance in openblas and mojo (GPT5.3 found this).

After fixing these bugs, I was able to get a reliable (Assuming I didn’t make any other mistakes lol) benchmark code working, which is what I using right now to benchmark the performance as I try to improve these routines.

I have updated the plots in the main branch! The current results make much more sense and seem consistent. Mojo performs good enough given level 2 and 3 are naive algorithms without any optimisation (in main branch). I will drop an update once I improve performance on level 2 algorithms.