I wanted to start by saying I really appreciate the incredible work you’re all doing on Max. I know the size of this undertaking is massive, which is why I was hesitant to even bring this up, but I wanted to share a thought.
I recently checked with the document chatbot (kappa.ai) and learned that Max doesn’t yet support sparse matrix formats (like COO, CSC, etc.) or sparse array formats.
While I don’t personally need them right now, I’ve seen in past projects how these formats can offer orders of magnitude faster matrix multiplication and be much friendlier for very large matrices in memory, which seems especially relevant for GPU operations nowadays. Given that I’ve read about related sparse techniques being used by AI models (Mistral 7B, LLaMA low parameter variants, Phi-2), I hope they can be considered as candidates for the roadmap.
Thanks again for everything! Keep up the amazing work!
The Mojo roadmap is mostly for languages features, so it wouldn’t have things like sparse matrix/array support in there. I think that everything you would need to implement that sparsity should already be in Mojo if you want to try your hand at contributing an implementation. “I need this so I built it” is a lot of how the Mojo stdlib is growing right now, and I think that a sparse tensor would be reasonable to include since in my opinion stdlibs are there for two things:
To hold the basics you need for almost all programs which don’t really change over time (math, basic data structures, threading, etc)
To hold ecosystem-wide APIs as a way to standardize things.
This would fall into the latter. If someone else doesn’t need it before you do, your contribution will be much appreciated.
To be honest, as someone migrating from R, I feel a bit intimidated to tackle a foundational implementation of this myself right now. My experience in R relied on wrappers around highly mature, battle-tested libraries (like the Matrix package which wrapped SuiteSparse and METIS). In the past, this package allowed me to achieve orders of magnitude speedups on fractions of the memory. While I could theoretically try to hack something together in Mojo, I worry a naive implementation wouldn’t do justice to the hardware capabilities.
I did a bit of digging into what “state-of-the-art” looks like here, and it reinforces my sense that this is a job for experts familiar with matrix multiplication idiosyncrasies:
PyTorch relies on a heavy stack including torch.sparse, FBGEMM (Facebook General Matrix Multiply), and vendor-specific libraries like NVIDIA cuSPARSE/cuSPARSELt and Intel oneMKL.
On the CPU side, libraries like LIBXSMM are heavily depended upon for JIT-optimized kernels.
Just looking at the tutorial for getting SOTA performance on dense matrix multiplication in Mojo required half a dozen optimization steps (tiling, vectorization, etc.). I suspect a performant sparse implementation is “low-hanging fruit” for Max in terms of value, but requires that same deep level of hardware-aware engineering.
I just wanted to keep this on the radar for the high-performance gurus here!
Hi @soshsquatch , I’ve been working on implementing SpMV on my free time as hobby (only when I have spare time between family and work). Maybe we can connect offline if you are interested in this. I have a partial (unfinished) implementation using the SELL-C format, which I think is SOTA. I’d be interested to connect with other people interested in the subject.
That’s fantastic news! I’m really glad to hear you’re tackling Sparse Matrix-Vector Multiplication (SpMV) and using the SELL-C format.
As I mentioned in my first post, I don’t personally need this right now, but I felt it as a potential gap in the ecosystem where someone could really add value.
I’m interested in following your progress. I think the GitHub route is probably the best way for the community to track and potentially contribute, but I’d be happy to follow along however you choose to share your work (the whole forum, GitHub, or privately).
@juaneco I went down a bit of a rabbit hole spending 30 minutes researching SELL-C after your post and stumbled upon a variation called SELL-P. It seems to be an algorithmic tweak that adds padding to rows to make it more efficient in some GPU implementations, so memory is more lined up in a beneficial way.
It sounds like it might be worth a quick scan for the implementation you are working on. I found a tech report which introduces and details the algorithm and includes some benchmarks showing an order of magnitude improvement over CSR and also SOA performance compared to two CUSPARSE implementations : https://library.eecs.utk.edu/files/ut-eecs-14-727.pdf . Google scholar shows it cited 70+ times. Further, just in case you are not aware, there is a permissively licensed library called Ginkgo library in C that can give a feel for implementation although I believe your implementation in Mojo could be very different
Can’t wait to see what you build and happy to provide feedback or input as best I can.
Thanks @soshsquatch , that is the implementation I am basing myself on. I think the only difference between SELL-C and SELL-P is that the latter has extra padding to ensure all threads in a warp execute, so the chunks of size b*t are always multiples of 32. However, I think that is not necessary in modern GPUs, it is fine to “disable” some threads to avoid the extra padding.
I wanted to share a quick update as I stumbled onto some relevant info while digging deeper into sparse formats and Mojo.
Lately, I ran across the Mixture of Experts (MoE) implementation in MAX. it seems the current implementation might be taking the efficient path isolating dense blocks for standard matmuls, dividing the problem up into smaller pieces that are needed in an inference pass, say a subset of experts . Brendan Duke gave a recent tech talk on MoE serving where they showed a slide with each different token in an LLM potentially being routed , in parallel, to a different GPU or entirely different compute nodes (machine).
I also came across MegaBlocks, a sparse matmul library originating from Stanford and now maintained by Databricks under a permissive Apache 2.0 license. It is also interesting because it also enables parallel operations on blocks even across multiple machines.
What really caught my eye was how MegaBlocks moves beyond static formats like SELL-C. Instead, it uses a hybrid approach with dynamic “views” of the data:
It uses a Blocked CSR view for the forward pass and inference.
It utilizes a Block COO (Coordinate) view for the backwards learning steps.
The extra metadata it maintains allows for efficient transposes, supporting both inference and training. While this might be more complex than what’s strictly needed for pure inference, it offers a way to use the same data structures for both stages. For those interested, here is a quick lecture on MegaBlocks by one of its authors, Trevor Gale (now at Google DeepMind).
Exploring this has been a fun way to catch up on neural network SOA…it’s been a while since Hinton’s online course