I’ve been thinking a bit about the Mojo roadmap. In particular working on Mojo becoming a superset of Python at Phase 3.
While I understand the pragmatism of focusing on what is already here in the language, and leaning into the strength of Mojo for high performance GPU and CPU programming.
I also get the feeling the baby is being thrown out with the bath water. There’s some good reasons why continuing to advance Mojo as a Python superset in parallel (though perhaps not mentioned in marketing to avoid misunderstandings of current capabilities) to the roadmap is worthwhile.
It creates the best curve for gradual complexity, being able to bring in Python code and have it simply work as Mojo is absolutely essential for ecosystem adoption.
Given that library interoperability is already here, the first pass at the problem could largely focus on reaching spec compliance for just the Python language.
Tests from CPython can be ran against Mojo, leveraging it’s testing infrastructure to battle harden the implementation (I also have experience doing this before for a Go compiler I wrote, and I would be willingly to spearhead this).
It forces the memory model to by default, need to work without hints from the user on ownership. This removes a major burden away from programmers not coming from a system programming background, and addressing the problem head on rather then kicking it down the road. ~ As a personal aside this final point is also why I’m really excited for Mojo, being able to gradually introduce complexity for non system level programmers to work on a system level programming language + GPU programming + tight iteration loop + Python interop, day 1 is spectacular.
I do understand that they are also some fair reasons against putting resources into this, such as the time needed to integrate dynamic types and object oriented programming paradigms. At the same time these dynamic features that make Python powerful, would also improve Mojo in many use cases. It can also be done in parallel minimizing the resources taken away from the current roadmap goals, perhaps it’s worth re-examining?
Please let me know if I’m missing anything in my reasoning, and I welcome all discussions!
The only to real option for mojo to become a strict superset of Python is to write a staticially linked python interpreter<mojopy> and optionally one or more jit compilers.
Modular was trying to do it in parallel, but people (me quite frequently, so you can blame me) kept finding soundness issues. For instance, how do you tell whether a python object, or Mojo’s old object, implements a trait? Especially given that Python interop will continue to exist. If Mojo’s object and the Python object are the same type, does that mean you must take the GIL in every function that can accept object? There is also the issue of type-level programming and “pure Mojo” code, which may not want to implement __from_object__ on every single type, or may be unable to like we found in many cases around using the type system to enforce invariants. For example, there are some Mojo types which cannot safely be wrapped in the kind of reference counting that Python uses, and as a result are incompatible with Python-style code.
Being a superset of another language is also quite hard, and I personally think it was too strong of a choice of wording. C++ is not a superset of C, and in fact the only relationships I know of where one language is a true superset of another is Typescript/Javascript and Bash → POSIX shell.
As a result of this, it was determined that making Mojo as good of a systems language as it can possibly be, then working on making it easier to use, would be better. Most of the dynamic things I can think of in Python can be built in Mojo with sufficient operator overloading, and I think we can get close enough for many purposes, but trying to have two extremely different dialects of the same language in one codebase is going to cause some conflicts. For example, things like “write a string to define a bunch of tensor operations using math” can be mostly unrolled at compile time in a manner similar to Evan’s blog post.
For the most part, if you are fine with using extra memory, just copying stuff around is perfectly fine and may still be faster than Python. Given that a lot of scientific code can mostly pass by reference and do most things up front, that resolves a lot of issues. There is work to make Mojo less onerous than Rust, but Mojo also needs to have thread safety, which means disallowing some thread-unsafe things python may let you do with free-threaded python. So, we are trying, but moving it back is a recognition that there are very difficult problems that will demand a lot of engineering time to solve. Those problems are things that demand other work be done first, which is why Mojo needs to become a generally useful language before it can take on looking more like Python.
I appreciate the thoughtfulness of your post. I will say upfront seeing your replies on discord about Mojo never being a superset pushed me to want to write this, and I’m very glad to read more details against the soundness of the idea.
Keep in mind I am coming at this with fresh eyes, so bare with me if I ask a lot of questions.
For instance, how do you tell whether a python object, or Mojo’s old object , implements a trait? Especially given that Python interop will continue to exist.
What is Mojo’s old object? Is it the prior implementation of PythonObject? My naive thinking is worst case scenario trait information could be passed alongside the object, if it can’t be deduced at compile time.
Mojo’s object and the Python object are the same type, does that mean you must take the GIL in every function that can accept object ?
Can you expand on this, I don’t understand why the GIL would need to invoked within a mojo file? (besides interop of a Python package)
there are some Mojo types which cannot safely be wrapped in the kind of reference counting that Python uses, and as a result are incompatible with Python-style code.
This is with regards to using Mojo code in Python? I don’t doubt they are types of this criteria, but I would be curious to read examples.
Given this is true, it would make Mojo as a superset valuable because then only Python → Mojo allows full compatibility.
There is work to make Mojo less onerous than Rust, but Mojo also needs to have thread safety, which means disallowing some thread-unsafe things python may let you do with free-threaded python.
I would imagine this is a very limited amount of Python code that relies on this given it has to be opted into. It is a fair point to bring it up though especially given it will eventually be the default, it might make sense to only support Python up to the version with GIL as the default to remove another extra layer of headaches of non thread safety.
So, we are trying, but moving it back is a recognition that there are very difficult problems that will demand a lot of engineering time to solve. Those problems are things that demand other work be done first, which is why Mojo needs to become a generally useful language before it can take on looking more like Python.
Fair point, I would say that perhaps there is a middle ground approach, take the programmatic point of view that Mojo very well may never become a complete super specification. At the same time, make Mojo able to cover more of the Python language as long as it’s pulling minimal engineering resources away from the current trajectory, and improving the general utility of the language as a whole.
From a value stand point even though it’s a large mountain to climb, having Python code getting run by the Mojo compiler, and allowing a dev to incrementally switch out parts with Mojo only variants, is such a good sell. It can support a well documented subset of Python and would for all intensive purposes have the same draw in.
I get having 2 dialects inside one compiler could be a nightmare. I however could see it not being so bad if Mojo still thinks of it’s self as a superset of Python and never getting to far away from being in the Pythonic family, then the features from Mojo can hopefully feel like a gradual and natural extension of Mojo being Python. However if Python is treated as a distant cousin and Mojo designs most things on it’s own, and then tries to rebound later, it seems like it would likely turn out worse compared to the former.
Mojo used to have a “writing types optional” way to write functions with def, so anything that came in would be the object type. It was separate to PythonObject because it didn’t support things like deleting fields from the object.
Also, to further expand the trait problems, you can’t move or copy a value unless it implements a trait. That causes problems
Can you expand on this, I don’t understand why the GIL would need to invoked within a mojo file?
Interop with a python package is a big one, but you can also write Python extensions in Mojo. That means that python code can just hand you an object that, when you try to do a dot access on it, runs arbitrary logic. Since it’s logic that involves touching python code, you need to have the GIL taken in case that version of python isn’t nogil safe or a thread unsafe C extension is included in the program.
This is with regards to using Mojo code in Python? I don’t doubt they are types of this criteria, but I would be curious to read examples.
If you borrow a string slice, that string slice needs to be dropped before the backing string is destroyed. If you move the slice into something refcounted, that might not happen any more, and can get a use after free.
I would imagine this is a very limited amount of Python code that relies on this given it has to be opted into. It is a fair point to bring it up though especially given it will eventually be the default, it might make sense to only support Python up to the version with GIL as the default to remove another extra layer of headaches of non thread safety.
A limited amount of code relies on that now, but it is expected that python will start to gain more threading as part of a general push for performance.
From a value stand point even though it’s a large mountain to climb, having Python code getting run by the Mojo compiler, and allowing a dev to incrementally switch out parts with Mojo only variants, is such a good sell
Right now, Mojo needs to grow until it can express most of the stuff that’s in python. Since Mojo is a systems langauge, it’s far better to work it Mojo independently and then build up things that look more like python later, since staying too close to python creates performance issues.
In my opinion, developing Mojo means developing the Python compatibility as well - since a lot of Mojo’s syntax and semantics are influenced and continue being influenced by Python. So, in that way the compatibility is being developed in parallel, just that it wouldn’t get to a stage where you can confidently say “use Mojo rather than Python” until classes are added at least, and that won’t happen until Phase 3. Between now and then, I expect we’ll see a lot of features which the syntax and semantics are influenced by Python.
@owenhilyard I appreciate you running through all of the points, it was very insightful. There is a lot of technical problems to overcome and I see more why focusing on Mojo independently is the better path.
@melodyogonna That’s a fair take, do you think that Mojo should take steps on the testing side to be more confident on Mojo’s capability to run specification compliant Python, or is influence of Mojo’s syntax and similarity, enough?
That’s a fair take, do you think that Mojo should take steps on the testing side to be more confident on Mojo’s capability to run specification compliant Python, or is influence of Mojo’s syntax and similarity, enough?
Python has no specification, it has a language reference which is vague on a lot of details.
to maintain compatibility Cpython runs unit tests of popular libraries using new Python versions, is this what you mean? I don’t think any existing large Python library would ever work in Mojo without changes since Mojo introduce a lot of keywords
Thank you for writing such a thoughtful post. As you mention, it is still a goal for Mojo to grow into an effective superset of Python, it will just take time. [1] Many of the design decisions we are making are keeping this goal in mind, we are just not prioritizing work to enable this.
Upthread the old object type is brought up as an example - this is a great example of the “bad thing” that was happening before we made this decision to focus. Several people (myself included) wanted to add Pythonic support, but the core type system wasn’t ready yet. As a consequence of this, things like object were implemented in entirely the wrong way. This led to an unending stream of bug reports and things we just couldn’t fix, because we didn’t have the infra to do so.
It is much better for us to focus on building out the core type system, get it settled, then build on top of it. This approach allows us to avoid creating language tech debt and exposing a false promise of compatibility far ahead of when it will actually work well.
-Chris
[1] FWIW, I think that people overindex on what “superset” means. C++ is considered a superset of C even though it isn’t literally source compatible. I believe that Mojo will end up with the same relationship to Python in the fullness of time.
I don’t think supporting all Python features in .mojo files is the solution. This is not possible because many of Python’s features, such as eval, are highly dynamic.
However, it is worth considering importing existing .py or .mojopy files while achieving near-native execution speed.
This could be made possible through AOT compilation combined with a highly configurable JIT that would allow the removal of most of the JIT overhead.
The idea is to use existing Python type annotations, such as .pyi and .mojopyi, to generate code. Optionally, strict typing could be enforced with a decorator like @strict_type.
Motivation:
Compile statically with the latest Python version and ship immediately: -f python-static.
Inlining across .mojo and .py files and removing the dynamic library overhead.
Support advanced performance features in .py files, such as @always_inline, @jit
Statically selecting the optimal JIT (@jit(2)), which removes type collection and deoptimization overhead.
I think Mojo is uniquely positioned to solve Python’s problems.
Mojopy does not have to be fully compatible; 95% compatibility is sufficient because these features significantly evolve the Python ecosystem.
I fully concur. I think of Phase 3 as the decorative phase of adding nuts/sprinkles after the cake is baked with great care. Nice to have, but the cake is perhaps more delicious without it!
I have come to think that “superset” becomes a slightly detracting term in this case. I would like to think of them as different registers of a language family. MicroPython gets in there at its own unique level. What matters to Mojo is Python interop - like Swift has with Objective-C, and for Mojo to be able to seamlessly extend Python and make itself feel very Pythonic, and for Python to actually be the “dynamic” glue to Mojo instead of Mojo having to become more Python-like to be a full “superset”.
The “dynamic” utility lies predominantly on the side of server side scripting. Python already does that. I think of Mojo as an out & out systems programming language that eventually scales from SBCs to hyperscalers.
PyPy is moving along, and CPython is also taking JIT seriously now, and so I think the real benefits of Mojo lie in its completeness by the end of Phase 2. When it has seamless Python interop, and first class C/C++ interop without bindings, it is already a language and platform that could be the most capable and enjoyable to use from modern smartphones to heterogeneous accelerators to hyperscalers, and perhaps even an “Embedded Mojo” that can go bare metal on tiny devices where MicroPython can’t. Mojo doesn’t need to YAP along (Yet Another Python) for its community libraries to become super useful everywhere.
The JIT would still be sophisticated and offer powerful optimization.
However, I would generally like the JIT to be predictable and configurable, and to speculate less and to deoptimize less.
Optimization should be conservative. The JIT should not penalize typed workloads because of the architectural limitations that come with optimizing untyped workloads.
This would particularly affect the garbage collector and memory usage.
This post is quite old, but I still think it’s quite accurate.
It’s the only large JIT where I haven’t heard anything about massive code debt, the removal of JITs, and the later reintroduction of an improved, more maintainable version. A similar design was later adopted by V8 as well. LLVM was replaced by a self-hosted optimizer, but the architecture is still the same (Introducing the B3 JIT Compiler | WebKit).
They have the same problem as PythonJIT: the language does not offer performance-related features, such as strict types and performance-related decorators, which would make JITs in general much easier and less demanding.
They are essentially hard coding heuristics optimized for an average workload, such as when the counter reaches 100,000 in the compiler.
If your workload does not match the average, e.g., if your workload is not numeric but has a lot of branches, then you won’t benefit from the JIT.
Programmers know the expected workload and data, but until the JIT figures it out, millions of cycles may have passed.
In this example, I’m making some random assumptions; it’s just about the principle.
Let’s assume that reaching stage 4 requires:
cycles of execution until stage 4 is reached takes 10,000,000 cycles.
Optimizing straight to stage 4 would take 1,500,000 cycles.
With the overhead of the JIT and all previous stages, it would take 15,000,000 cycles.
The type is not known at compile time, but it is always the same at runtime.
The JIT does not know if the types are consistent or how many times the function will be executed.
By the time the JIT knows, it could be too late. What if, after reaching stage 4, numeric_workload is never called again?
You could have a JIT optimized for this specific workload, but if the types changed in two invocations, this would immediately trigger a deoptimization.
JITs generally wait. This is because guessing wrong has a drastic penalty.
Understanding these dumpster fire C++ codebases is not necessary to understand the constraints of a JIT and the associated overhead.
The physical constraints of a CPU
A CPU with six execution units can execute a maximum of six instructions per cycle (IPC). For a real workload, it will be much lower because not everything is a scalar instruction, and there are caches, etc.
The basic math is as follows: If you interpret or use a JIT with unoptimized code, and you execute 50 times more instructions, the performance is probably 50 times lower.
You could argue that the time it takes to optimize is not relevant since the jitting can be done on a background thread. However, you don’t have unlimited CPU time, so it’s not free, and it does not work for multithreaded workloads.
JIT FORMULA:
cycles_to_execute < cycles_to_optimize for each phase
In general, the Python JITs all have different constraints.
Faster CPython is not doing well.
Until Python 3.14 they didn’t signifantly speeded up CPython.
Microsoft did not see a great Return On Investment (ROI) in spending x millions of dollars to make the CPython f(x)% faster.
PyPy is still on version 3.11.
They cannot use types to make the JIT do less work.
They will never be able to match a JIT that gives programmers control over the JIT.
PyPy is a tracing JIT.
The function level is generally more predictable and matches the Mojo optimization model.
Why did Swift go a different route and we need an Even Better Python (EBPY)
In general, Never Settle, if the solution is not perfect.
Unlike Apple, Modular is a company with limited resources. Apple could afford to rewrite its entire ecosystem in Swift.
Modular may consider a more scalable approach than rewriting their entire ecosystem from Python to Mojo.
First, why import Python?
It’s slow, makes the binary less portable, and regularly breaks. Thus, you only do it if you have to and there is no other option.
Was there something worth preserving about Objective-C?
In general, Apple restricts the use of JITs in its ecosystem.
Python’s syntax is fine. Technically, Python is not impressive; they get away with breaking an interpreted language on a regular basis, yet people still like it.
Python has a lot of adoption and a massive existing ecosystem. There are many arguments for preserving Python.
The Python standard library is not the highest quality, but it’s still better than a mix of third-party packages, which gives us the opportunity to say that Python breakages are really their problem, not ours.
Mojo already has a perfectly serviceable JIT coming out of using orcjit with the already existing Mojo compiler. We can already hit the maximum performance of the hardware with Mojo (given enough developer effort), so an optimizing JIT likely won’t help Mojo performance.
I’m personally of the opinion that if your hot loop is in Python you’re doing it wrong, and that python exists to set up computations and then wake up native code to go do all of the actual work. Making python code much faster is very difficult due to all of the things it lets you do, and an implementation of python aside from CPython is likely to never gain traction.
I believe Steve Klabnik (involved with Rust at the time) once said something along the lines of:
People disagree on things because they prioritise things in different orders.
The idea is that someone working in Python prioritises speed-of-development, while someone working in Rust prioritises correctness-of-development.
A focus on speed-of-development does not mean that the Python developer does not care about correctness - it just means that the Python developer prioritises speed-of-development above proving correctness of the code.
This is good in the right context.
A focus on correctness-of-development does not mean that the Rust dev does not care about speed-of-development - it just means that the correctness of the code is prioritised above speed-of-development.
This is good in the right context.
Why should the world need more than one programming language?
Because different people prioritise things in different orders.
So, they will use a language that prioritises the things they care about.
Different problems need to be solved in different ways. Sometimes you need performance, then use Rust, for example. Sometimes you need code-adaptability, then use Ruby.
Sometimes you need to write a website-frontend, then use Javascript/HTML.
No perfect programming language exists, in the same way that no perfect spoken language exists. Each programming language is a series of different trade-offs.
@leb-kuchen The reason JS can’t really use LLVM is because it needs to put LLVM in a hot loop, which is a bad idea. Mojo has no need to do that because it can generate C-speed code on the first pass without any guessing, so you just have a startup penalty and then it’s fine.
I agree with @monte, searching for a language which makes everyone happy is a fool’s errand. You want to put the people who wish to write Conway’s game of like this:
And the people who will cause a nuclear reactor meltdown if they do anything wrong, and the people who just want the turtle to move around the screen, and the people who need speed of light performance from the hardware, all under one language?
Yes, AOT compiling Python into a single Mojo executable is undeniably a massive technical undertaking. However, the market value of solving this deployment pain justifies the effort entirely.
There are two main reasons why this is worth pursuing right now:
The Multi-Million Dollar Precedents: The market appetite for this level of tooling is staggering. Modular itself recently raised $250M (bringing its total funding to $380M) to build AI’s unified compute layer. Looking at the JavaScript ecosystem, Anthropic just acquired Bun for a rumoured maximum of $300M. A massive driver of that valuation is Bun’s ability to compile apps into single, dependency-free executables. As AI agents write and execute more code, they require fast, predictable, self-contained environments. Solving Python’s deployment friction is a massive value multiplier.
The AI Era: A transpiler project of this scale would historically take a decade of human brute force. Today, with frontier AI coding agents to assist with the endless edge cases and optimization rules, building this is absolutely feasible.
Mojo has the opportunity to be the “Bun of Python” but with systems-level performance. The engineering mountain is high, but the payoff is monumental.
The soundness you refer to is because of the GIL and typed variants.
This soundness of trying to use python’s dynamic style in comparison to Mojo’s statically typed ecosystem made Mojo more like a Rust superset not python
True or false or perhaps partially true…