About the decision of removing the Max Tensor APIs in Mojo

Currently, the usability for performing tensors operations using Pytorch, from the programming perspective, it’s very concise and powerful. For example, if we compare how we can create a kernel for performing a broadcast add in Mojo GPU (without max), we had something like here: mojo-gpu-puzzles/problems/p05/p05_layout_tensor.mojo at 763da1d193680afb396e13582397cd9c6673e5e1 · winding-lines/mojo-gpu-puzzles · GitHub

In pytorch, a similar code would be very concise:

a = torch.tensor([1, 2, 3], device='cuda')
b = torch.tensor([10, 20, 30, 40], device='cuda')
# Broadcast addition: out[i][j] = a[i] + b[j]
out = a[:, None] + b[None, :]  # shape: [3, 4]

The Max Python API is a great and concise library to work with Max and even If I did not deeply tried, as I am more Mojo GPU focused right now, it seems it will accomplish the same as Pytorch.

But, if we want to use Mojo for simple tensors operations, as the Modular team removed the Max Tensor API (e.g. see this commit), we don’t have AFAIK any other option that building a GPU kernel.

However, thanks to the Modular incredible work in the Mojo language and recent improvements in usability, I wonder why not creating a Tensor struct in which we can try to play with them directly in Mojo. Actually, I think that if Mojo is a AI-first language, the stdlib could include them, so it would be a factor that differentiate Mojo from the rest of general purpose languages.

I believe some of the stdlib contributors (myself included) would be happy to maintain this part of the library, so we don’t have to switch to Python to do cool things in a concise manner, even in a Jupyter notebook maybe.

What do you think?

4 Likes

CC/ @owenhilyard

1 Like

I agree that a MAX API in Mojo is something we should have. I think that all of the discussion and marketing about “one language for AI” is somewhat undermined when you have to use two languages.

However, I think that these bindings should stay separate from the stdlib. The field of ML is evolving quickly enough that we want the freedom to evolve the tensor type without a stdlib break, especially if we need to change things in response to new hardware. Having it a little off to the side, as MAX is, would probably be beneficial.

I have a limited amount of bandwidth, but I’d be willing to at least try to put in some design work (since I can think about designs during my commute), since I think that a Mojo-first API for MAX would be very helpful and that linear types or the typestate pattern may massively improve the usability of the API. I’ll try to help out with coding as well, at least for whatever weird things I come up with for making MAX type safe since I think that the old MAX API had some of the things I dislike most about Python APIs, namely poor types and tons of runtime errors.

I think it’s also worthwhile to consider those of us who can’t switch to the Python API. Low-latency inference or GP compute using MAX may exclude the use of Python, and I don’t think that many people want to embed both MAX and Python into, for example, a C++ or Java application when the Pytorch C and C++ APIs exist. Embedding a C ABI extension which manages MAX itself would be much more palatable to me, in part because I can’t afford to have a garbage collector in many of my projects, which categorically excludes Python.

2 Likes

Also cc @TilliFe

What is the performance penalty of trying to use the Python bindings in Mojo?

@owenhilyard , thanks!

To take advantage of probably many hundreds of hours by the Modular team, why not just reverting the best of the Max Tensors API removed, so you instead of designing the API from scratch design the deltas over that version? I would be happy of implementing those deltas in my spare hours.

On the one hand, we had something that worked with no effort, something that others could try to contribute to. The Stdlib has been deprecating and breaking things for many months, so I don’t see why we could not break things here.

For me, the performance penalty is in the “I cannot meet an industry standard SLA with a python GC active” realm. GC pauses are entirely a non-option for dataplane apps, especially when doing something that would create hundreds of thousands or millions of objects per second. It also forces me to run process per CPU core and run in multiprocess mode due to Python being single threaded. With current hardware, 64 cores in a server is not that much, which means that having 32 cores waiting on the GIL to run a bit of Python is a fairly likely scenario. When I’m measuring processing times in nanoseconds, having a big mutex that needs to be used by the majority of my cores at any given time is not an option.

Python is simply not an language that belongs near the kinds of applications I write. Even Go is unacceptable unless I were to disable the GC, same with Java and C#.

That might be doable, and I debated making a type safe layer on top of the MAX API previously. I’ll take a look at it once I’m done with the IO API.

When you’re talking about abstractions over tensor operations, there’s two different ways that this can be approached. There’s treating these operations as computational graphs
(like we do in the Graph API), or performing calculations in Mojo code to be run directly as a function on the CPU or GPU.

For the former, in a separate post, I talk about why we made the difficult decision to open-source and then wind down the Modular distribution of the Mojo Graph and Driver APIs, so I won’t re-hash that here. I will say that if you decided to invest in modernizing that API, I would not use the old max.tensor Tensor as a building block. It is fundamentally incompatible with use on GPUs, and we found that its design posed a lot of problems. It may have elements to its interface that could be ported to other tensor-like types to make them easier to work with, but I wouldn’t recommend carrying its core forward.

Now, if it’s just that you want a better abstraction for Mojo tensor operations that you want to perform directly, there may be an opportunity to enhance the API around LayoutTensor, which we use as our first-class tensor representation for CPU / GPU functions. We’re migrating off of NDBuffer and other data types to focus on LayoutTensor as the primary way to represent tensors inside of Mojo GPU functions like the example you point to. Many of these examples are shaped in a way to be familiar to CUDA programmers, but it’s possible that better abstractions for elementwise calculations at the LayoutTensor level could make writing Mojo kernel functions even cleaner.

3 Likes

Two things here.
Is this still a problem if you use Python as just some sort of exchange format between Mojo and C++?
Hmm, thinking about it, this would depend on how much extra features are added on top of the Max graph API.

Second thing is that Python’s cycle GC can be disabled, you only need to make sure you don’t have strong cyclic references.

I was actually reasoning that maybe the process of encoding the data as PythonObjects would impose significant overhead.