Porting various models to MAX

Hi everyone,

I experimented with Mojo in its early days, tried compiling models to run on CPUs and ran some benchmarks.

I am looking to explore and attempt converting different models to run on MAX. I would like to start with a torchscript or possibly ONNX model and get a report of the supported operations, along with the ones that need implementation.

I would like to implement the unsupported operations on mojo one by one and incrementally work on porting the whole model to run on MAX.

Ideally, I want to be able to benchmark perforance of both versions using Nvidia Nsight systems. I am thinking of using nvtx with C bindings to mark ranges even though I don’t expect a detailed trace from MAX without cuda.

I doubt I would be able to write kernels with SOTA performance but I am very interested in creating tooling around this process to make it easier for smarter people.

Any pointers or suggestions would be greatly appreciated!

Thanks in advance.
Mert

You’re catching us at a time where we’re just getting the pieces in place to make this a good experience. For example, the latest MAX nightly adds a new flag to our serving system (--custom-models=folder/path/to/import:my_module) that lets you bring your own LLM-style architecture and serve it using MAX.

If you’re working from a TorchScript or ONNX model, however, you might be looking for something a bit outside of the generative AI space. You still could build it using the MAX Graph operations and the max.nn layer abstractions we’re building on top of those operations. We have some examples of graph construction in the MAX repository to draw from, but if you have a particular shape of model in mind maybe we could provide a little more directed guidance.

When it comes to operations themselves, the computational graphs defined in MAX tend to be a little finer-grained than layers in other frameworks. What may be a single operation or layer in PyTorch could possibly be composed from smaller MAX graph operations, so it may be a little harder to identify a one-to-one correspondence. However, yes, if something is missing you definitely could write your own custom Mojo operation to fill in the gap. Or even if you could construct the equivalent layer from MAX Graph operations, but you think you can squeeze out a little more with a manually fused version. Starting with an unoptimized version of an operation can often give you good starting performance, and then you can tune from there.

Keep an eye on the open-source MAX repo over the next few days, some things might be appearing there that could be a big help with writing your own operations. I’ll also say that we do a lot of Nsight Compute profiling of our MAX Graph models and Mojo kernels, and we’re working on putting together some documentation and an example of how to do that on your own graph.

3 Likes

Thank you for the thorough response.

I did not realize the mistiming, I will certainly give the new flag a try.

I was thinking of working on CNNs mostly or other fresh out of R&D models that are not LLMs yet +%90 supported by MAX that heavily utilize transformers, like some popular image/video to image/video models out there.

Nsight profiling sounds really exciting, I was not expecting that.

Lastly, I want to say how much I appreciate the incredible work you guys are doing. Being able to work with a 1GB container that just works not to mention with multivendor support is just so unbeliavable after countless hours of debugging and fixing various version mismatch issues trying to deploy 20GB+ containers in the past.

1 Like

@BradLarson do you have an ETA on when docs/examples will be available for --custom-models?

We have just a little more about the --custom-architectures flag (we renamed it to be a little broader) in the nightly changelog:

Added support for loading a custom pipeline architecture by module. Using --custom-architectures=folder/path/to/import:my_module will lead to loading architectures from the file. The architectures must be exposed via an ARCHITECTURES variable in the file. Once loaded, a model can be run using the new architectures. The flag can be specified multiple times to load more modules.

We are working on some documentation for this, because we do want to start showing how to use this new functionality to incorporate your new architectures (like you’re working on for Mamba) into MAX for serving, etc. I can’t promise when this will be available, because we’re first working to get the Mojo GPU libraries and kernels all in open source. We very much want people to be able to contribute new / better kernels and models, so we’re investing in getting all the support in place for this in the next few weeks.

2 Likes

Makes sense! Looking forward to when those docs land and ofc to all those mojo gpu kernels.
Many thanks to Modular for prioritizing oss and building this amazing stack.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.