Over 10,000 organizations rely on BentoML to ship models to production, including 50+ Fortune 500 companies. We’re bringing together BentoML’s cloud deployment platform with MAX and Mojo’s hardware optimization.
BentoML remains open source (Apache 2.0), and we’re doubling down on our open source commitment, with much more coming later this year.
What this means for production AI:
Code once and run on NVIDIA, AMD, or next-gen accelerators without rebuilding
Optimization and serving in one workflow
Deploy with your own infrastructure with modern performance
Fewer layers, better results
We’re delighted to welcome Chaoyu, Sean, and the entire BentoML team. Together, we’re building the complete AI infrastructure stack.
Join us on February 17th at 9:30-11:30am PT for an Ask Us Anything with Chris Lattner and BentoML Founder Chaoyu Yang here in the Modular forum. We’ll answer questions and share more about our plans! Feel free to share your questions now as a reply to this post.
I would like to know if BentoML can support for NPUs because I want to move past standard GPUs and move towards the future of AI which lies on the neural engine’s.
Because I heard BentoML supports cloud deployment at scale.
Complementary tech stack: Modular focuses on the optimization layer with Mojo for high-performance development and MAX for hardware-aware model optimization, while BentoML specializes in production deployment and serving infrastructure. Together, they cover the complete journey from model optimization to production deployment.
Shared for open source: Both companies are built on open source foundations: BentoML operates under Apache 2.0, and Modular has deep commitments to open source, with plans to fully open source Mojo by the end of 2026. Open source is in the DNA of both companies.
In terms of timing, we have big plans but it will take some time to roll out, stay tuned for later this year.
We’ll share the general goal though - GenAI deployments are always getting more complex, you need full system optimization, access to heterogenous hardware, control over application-level optimizations and all the fancy stuff Modular has been building for years. However, management of cloud resources is a real pain, and integrations into your VPC and other systems makes the production deployment pain dissolve away. BentoML built an incredible platform for managing this.
We’ve been working together now for some time, but this change allows us to vertically integrate even further. We’re very excited about that!
Hey Trojan! I don’t expect BentoML to affect NPU or other hardware support in Mojo. We’re eager to continue expanding our support there - e.g. check out the expanded support for AMD consumer RDNA GPUs that landed last night! BentoML technology is cloud focused not edge.
Mammoth is a key part of our technology stack for large scale distributed inference, it works great within BentoML already and we’ll be tightening the integration. Our vision is to get a unified Modular Cloud product that makes it super simple to scale and deploy with full control and performance for your workloads.
Our commitment to open source does not change. BentoML will remain Apache 2.0. We’ll continue to ship new features and support the community, at the same pace if not faster.
Even with Mojo kernels, an RDNA GPU is still bound by the PCIe bus bottleneck and the overhead of batching packets to keep the compute units busy. In my tests, trying to push raw sockets through a GPGPU architecture ruins the deterministic jitter required for wire-speed capture.
If BentoML is optimized for RDNA’s batch-heavy throughput, how will Modular handle the low-latency, single-packet-deterministic path that could perhaps be a replacement of NPUs?
Yes, BentoCloud will continue to be fully supported; nothing will change for our customers.
Joining Modular means we can be even more ambitious with our mission. We’re taking the multi-cloud infrastructure foundation we built in BentoCloud and using it as the basis for Modular Cloud. These two projects will co-evolve together and over time, we’ll create a unified offering that gives you the best of both worlds.
What this means is that BentoCloud customers will benefit from continued feature development. You’ll see new capabilities, performance optimizations, and deeper integration with Modular’s inference stack. Expect development to accelerate significantly.
More portable: One stack for NVIDIA, AMD, CPU, or any accelerator
Think of it this way - Mojo/MAX is the high-performance runtime engine that powers inference and executes your models, while BentoML provides deployment orchestration: serving multiple models, handling business logic, and simplifying the deployment operations around inference.
The teams behind Mojo/MAX and BentoML are now working together towards this vision. We’re actively exploring how to make this integration seamless, and it’s one of the most exciting parts of joining forces.
We can’t share all the details just yet, but you can expect increased support for Mojo and MAX in BentoML! The team is working on the integration roadmap right now, and we’ll be sharing more updates as we make progress.
Joining forces allows us to put more resources behind our cloud roadmap and ship faster. If you’re building with Mojo and MAX, you’ll get a much cleaner path from optimization straight to production. Take a model you’ve optimized with MAX and deploy it through BentoML’s infrastructure that’s already been battle-tested by thousands of companies. The real win here is that we can actually design the optimization and serving layers to work together instead of duct-taping them together after the fact. So the performance you’re getting from MAX will make it into production without the usual integration pain.