I don’t really think that there is a difference between AI/Human/mixed code generation. I think the main problem is the rate of change. the standard library is arguably the most important artifact in a language after the language it self. almost everyone is depending on it and it acts as a guide for further development in the community.
So I think that the main criteria is the code contribution is that it should be deliberate. The rate should not just allow first-pass review but also user evaluation and feedback before moving forward with even more features that depends on the newly introduces ones. This friction should prevent the fragmentation of the stdlib each with different parts moving in different directions. scoping, design discusion, user feedback, and re-evaluation takes time. high-rate of additions just does not work with this approach.
I totally agree, ownership, commitment and value is what’s important. This is why I would like to bisect the examples you brought up.
Let me walk you through the process of Hasher contribution.
- I wrote a proposal https://github.com/modular/modular/pull/2250 laying out the reasoning, there was a discussion with multiple members of community and the proposal was accepted
- I implemented a local POC to check if it was feasible
- Then followed a bunch of PRs from May 11, 2024 till Sep 5, 2025. With every PR I tried to minimize public API surface changes, although at the end the public API changed dramatically
- [stdlib] Introduce Hasher type with all necessary changes by mzaks · Pull Request #2619 · modular/modular · GitHub
- https://github.com/modular/modular/pull/3476
- [stdlib] Adopt AHasher to Hasher trait by mzaks · Pull Request #3604 · modular/modular · GitHub
- [stdlib] Implement _HashableWithHasher trait on all types which implement __hash__ method by mzaks · Pull Request #3615 · modular/modular · GitHub
- [stdlib] Switch to hasher based hashing by mzaks · Pull Request #3701 · modular/modular · GitHub
- https://github.com/modular/modular/pull/4769
- [stdlib] Moving DJBX33A from hash to it's own module by mzaks · Pull Request #4841 · modular/modular · GitHub
- https://github.com/modular/modular/pull/4863
- [stdlib] Adds hasher parameter to Dict by mzaks · Pull Request #4922 · modular/modular · GitHub
- [stdlib] Introduce default hasher type as a workaround by mzaks · Pull Request #5061 · modular/modular · GitHub
- [stdlib] Remove the DJBX33a hasher to address #5270 by mzaks · Pull Request #5275 · modular/modular · GitHub
- And we are still not done, on on Feb 5th 2026 I opened an issue to discuss further improvements and followup on initial proposal which is now possible to implement thanks to compiler improvements [Feature Request] [stdlib] Further generalization of the Hasher trait · Issue #5907 · modular/modular · GitHub Mind you I did not create a PR, as adding code is easy (with or without AI) maintaining is hard, although I think this addition will be beneficial for the users of Mojo standard library I wanted to get the feedback (and buy in) from the modular team. If you have a look at the history/blame of the hasher.mojo file which contains only one trait definition Blaming modular/mojo/stdlib/std/hashlib/hasher.mojo at main · modular/modular · GitHub you can see that the maintainers needed to touch this very simple code multiple times since it was contributed
So here is the thing, as external contributors we have ownership until the feature lands in the repo, after that we are off the hook. We can still feel responsible and supply further contributions (I guess this lands under commitment), but we are not to “blame“ when something does not work. So commitment is voluntary. What stays is value. And here is where things get pretty interesting. Open source contribution has an implicit gamification feature. The more commits you perform, the higher your contributor score. Which kind of made sense before AI, being able to break down a feature into multiple small commits is a good skill. With AI agents I can game the system though. I can rise to the top of the contributors leaderboard, by making the agent create tons of small PRs which do not necessarily create value for the project.
I am very sorry, but in order to drive the point home, I will have to take the String.capitalize() PR you mentioned as an example [Stdlib] Add capitalize() and title() to String and StringSlice · Issue #6177 · modular/modular · GitHub
Adding capitalize() and title() methods to String and StringSlice increases public API surface of this types, hence automatically increase maintenance burden. Mojo String same as Python string support Unicode standard. The Unicode standard has a complex and annually evolving rule set for capitalization, which is btw. slightly different from the upper casing rule set. The PR above implements capitalization and title for ASCII only, meaning that a user who might use the function in this state for non ASCII character set will not get an expected result and will file a bug, like I did on Dec 22, 2023 for toupper and tolower [Feature Request] Support String.toupper and .tolower for non ASCII chars which have upper lower cases · Issue #1543 · modular/modular · GitHub in order to implement this functionality correctly, we would need to introduce a big PR, similar to what I did in order to fix toupper and tolower [stdlib] Add full unicode support for character casing functions by mzaks · Pull Request #3496 · modular/modular · GitHub . But there is actually a general discussion driven by @martinvuyk if this approach and API is correct and I generally agree with him. So my PR which was a fix, can be considered somewhat of a poison pill.
What I am trying to say with this example is: code is a liability, valuable contribution is the one which maximizes the functional value with as small amount of liability as possible. The gamification of open source contributions paired with AI-Tools can have a net negative effect, as we tend to spend less time on the feature, overseeing its complexity and liability implications.
I don’t mind referring to the capitalize() and title() examples, and I agree with your point. But I don’t think this is really about how much AI we allow in contributions.
For example, I wasn’t aware of the Dec 22, 2023 discussion around toupper/tolower. So the realistic alternatives would have been:
a) I spend ~1 hour implementing capitalize()/title() manually, with a bunch of tests, only to have the PR rejected because the constraints weren’t obvious to me.
b) I spend ~5 minutes asking AI to implement it, review the draft PR, and move it to ready for review. with the same rejection outcome.
In both cases, the maintainer still has to review the PR. Having a human in the loop doesn’t remove that burden.
What could actually help is shifting that effort earlier in the process. For example, we could have an AI workflow (or skill) that checks existing discussions and feasibility before implementation. That kind of pre-check could save contributors hours of wasted work and prevent frustration, especially for newcomers who lack context. This is also a automatic tool that could be included in the maintainers’ toolset, via GHA or the same Claude Code skill they could use before the manual review.
In that sense, leveraging AI and improving the contributor AI-driven workflow would likely reduce review burden, not increase it. As long as the maintainers detect something that could has been done differently, just improve the instructions in Claude.md / Agents.md and/or skills.
I think there’s a lot of room for improvement there. E.g., the multiple Claude.md files explain how to write Mojo/Max/Kernels code, but they don’t guide the agent on how to approach feature development or do proper research before contributing.
We should also think about how companies like Anthropic handle this. They claim that Claude writes 100% of Claude Code — so shouldn’t they face the same review burden?
How do they solve it?
My guess is that they rely on multiple layers of automation, including on the review side.
While the initial issue brought up was PR, I do think this point above addresses a core issue. How does someone who would like to give up their time and effort for free get up to speed so that their contribution becomes literally and figuratively a contribution. What I’m hearing is that there is friction e.g. design discussions in Discord, proposals on Github, or a forum thread, that can frustrate the contributor. How can an AI facilitate this reduction in friction? As you say, there are other people are doing it. Framed this way, it seems like a problem that has always existed though now there appears actually be a way to solve it.
It is a hypothetical, but I think if you would implement capitalized by hand, first thing you would do is to check the code of upper() and lower()to get a feeling for how things are implemented, then you would directly see that it is implemented in _unicode module and there is a rabbit hole behind those functions. But as you rely on the agent to build it and you trust the agent to do the exploration and the generated code looks generally legit, you do not need to spend additional time to be paranoid
. You are an experienced developer, standard library contributor and user of AI tools, still this happened. How often will it happen to less experienced developers?
To be fair Copilot caught the general faux pas in regards to ASCII only, however it definitely will not catch the intricacy of Capitalized vs. Uppercase in unicode standard. I just had a discussion with Claude code about this problem https://claude.ai/share/18edb587-21a7-49f4-b26e-59b7d9d03126 and I had to interrogate it with multiple questions to get a kind of a full picture, where I assume it is still partially incorrect.
Regarding Anthropic claiming to write 100% code of Claude Code with itself. I would love to hear the insights an workflows from an actual engineer, not the marketing department
. And anyways, Antropic using Claude Code for Claude Code at 100% equals to Modular standard library team writing standard library code, not external contributor creating PRs for the team to review and maintain. There is this factor of ownership and commitment that we as external contributors don’t have.
I have ideas for PRs I’ve been holding off because we first need to communicate and have people agree before pushing for things (this was a bit hard for me to learn since I’ve always been a solo developer). Opening many PRs without previous discussion (the only IMO valid exception is a proof of concept draft PR for something that is very complex/hard to picture) is a fast way to overwhelm reviewers and negatively impact the overall project advancement.
To me this whole issue is more about quality control, and who has to invest the time and energy into it. I could dump 5 PRs in 20 minutes, no LLM assistance needed; but what quality of code would that be? what if I introduced a bunch of bugs and inefficiencies?
I use LLM tools at my job daily, and dumping a ticket description and have it to the brainless bits is great as a time-saver. But still I have to carefully review every line it produces. For my Mojo contributions I like to use my own brain and invent stuff. The other difference is that quality matters a lot more in a standard library than any of my client’s websites or SQL queries.
I personally would not try to lower the barrier to entry to LLM contributions outside of aesthetic or design-pattern-oriented decisions; I would keep it out of performance sensitive changes or any new additions. I can’t count the amount of times I’ve corrected SQL code that is just moronic in structure (because there are so many bad queries out there that it trained on). Any LLM generated code needs heavy review by the PR author.
There is another issue with lowering the barrier to opening PRs which I don’t think anybody mentioned before: human feelings. Nobody wants to outright reject a PR and make someone else feel bad, even if the PR’s contents are technically not what is wanted by the mantainers. I’m a pretty blunt person and can take criticism or rejection of ideas, but the US culture and especially Modular’s culture is about not hurting people’s feelings. Just look at how the internet reacts to Linus Torvalds’ takes. Every lowering of the barrier to entry and new PR that doesn’t follow the guidelines is a PR that takes mantainer’s time in trying to not hurt their feelings and educate them on the project’s standards.
I think a good use for LLMs in this case is adding a first filter review that is automatic and checks for things that are mentioned in the CONTRIBUTING.md, and we should probably also add a way to check against open issues or whatever issue is linked in the PR and make the LLM evaluate appropriateness. We could also add tags like “approved for work” or something like that. But of course there are some PRs that are simple and align with overall project priorities that should be accepted.
Making the LLM do the rejections will avoid anybody’s feelings getting hurt. Nobody is affected by a linter other than being frustrated, but having someone nit-pick you to hell over docstrings (I’m that guy) annoys everyone.
It is a hypothetical, but I think if you would implement capitalized by hand, first thing you would do is to check the code of
upper()andlower()to get a feeling for how things are implemented, then you would directly see that it is implemented in_unicodemodule and there is a rabbit hole behind those functions. But as you rely on the agent to build it and you trust the agent to do the exploration and the generated code looks generally legit, you do not need to spend additional time to be paranoid. You are an experienced developer, standard library contributor and user of AI tools, still this happened. How often will it happen to less experienced developers?
First of all, I probably opened too many PRs in a short time to explore how things work. I’m not an expert in low-level Unicode, most of my background is in high-level Python, and I don’t have much time to contribute, which can be frustrating. I do hope we can adopt Mojo at my company so my experience becomes more relevant here.
However, this issue would happen less often if we actually leverage AI to catch these issues.
If the Modular repo were much better prepared for AI-driven workflow, like the one we use in our company, and you can trust me, I’m not from marketing, in which engineering is not hand written any code anymore, the AI could become even more “paranoid” than a human: checking existing implementations, surfacing prior discussions, and flagging inconsistencies before a PR is even opened.
But that only works if we go all-in on refining the AI workflow. If we fall back to manual processes every time something slips, we’ll never get there.
AI isn’t “not there yet”, it just needs the right context and iteration to work consistently well. And the only way to reach that point is by dogfooding and continuously improving the workflow.
If there is a concensus on improving the AI workflow more than to restrict it. I could help on this with my time.
Sorry, I missed this final paragraph. You could point to 2–3 of my PRs that were rejected, but there are many others that have been upstreamed in the last month, as you can see here: https://github.com/modular/modular/pulls?q=is%3Apr+author%3Amsaelices+label%3Aimported-internally
I wouldn’t believe I could have contributed all that by spending only a few human hours in total, also learning a lot from AI during the whole process
I strongly believe that fully embracing AI-driven workflows has a much more positive impact than restricting them.
I think this is important to highlight. LLMs are, largely, pretty bad at writing hot loop code. There’s so many examples of “good enough” loops that look very similar to constructs a standard library needs all over the place, and only a few examples of the code in the context of “speed of the hardware”. I’ve also found that LLMs are generally more hesitant to use some of the powerful type system features in Mojo, such as Linear types. To me, this invalidates a lot of cases of using LLMs for design work, since we need to built the stdlib as a body of examples of design patterns for linear types before LLMs can start to “get” them. To me, this disqualifies them from helping in a lot of design work.
I agree, we need to consider Extractive Contributions, where the effort of shepherding an LLM-written PR (since it seems to usually be LLM written, I’ve never run into this issue with fully human code) is such that it would be more beneficial for the maintainer or contributor doing the shepherding to instead be working on other things, be it reviewing other PRs, writing code themselves, or other work inside of the project. Throwing that rule at a new contributor’s PR is going to have a pretty high likelihood of driving them away, since it’s fairly heavy handed. While not always explicitly written down, many open source projects have similar rules or standards where a contribution can be rejected for not being worthwhile to fix, but in all of them I have participated in, it is typically treated as a last resort. If we leave this as the only real mechanism left for rejecting huge piles of code pushed out by LLMs due to loosening the rules, I’m concerned it will be invoked too often.
I think Mojo also needs some classical static analysis tooling here. Part of my hope around my effect system proposals is that it would make it easy to clarify that some things really shouldn’t be done in particular scopes. It’s very frustrating to deal with tooling that you cannot run without submitting changes to the PR and which involves pinging everyone connected to the PR every time you want to test a change. If we go the route of using LLMs as an actual gate for submissions, I think that a mechanism to run code through them locally is necessary. This is especially true since unless Modular decides to host open models for this purpose or use ones small enough that MAX can run them on a reasonable developer system (the LLM cannot be larger than the memory consumed by compiling the standard library), as many closed models can change without warning and thus cannot be used as deterministic tooling even when we can control the seed.
I think this is important to highlight. LLMs are, largely, pretty bad at writing hot loop code. There’s so many examples of “good enough” loops that look very similar to constructs a standard library needs all over the place, and only a few examples of the code in the context of “speed of the hardware”. I’ve also found that LLMs are generally more hesitant to use some of the powerful type system features in Mojo, such as Linear types. To me, this invalidates a lot of cases of using LLMs for design work, since we need to built the stdlib as a body of examples of design patterns for linear types before LLMs can start to “get” them. To me, this disqualifies them from helping in a lot of design work.
I’ve heard the “LLMs are bad at XYZ” argument many times from coworkers and, when I dig into the specific cases, it almost always ended up being a skill issue on the human side, mainly around how well the developer provided the right context to the model. To be clear, that “almost always” really only applies from late 2025 onwards, with Opus 4.5+ and newer models.
Getting good results requires investing time into building the right workflows: providing rich context, setting clear constraints, validating outputs properly, building benchmarks, and creating a feedback loop of continuous improvement (something like Karpathy’s autoresearch idea applied to your own workspace)
Maybe you’re right and there’s still some handicap when AI is coding Mojo compared Rust or Python. But what if that’s not a fundamental limitation, and instead just a reflection that the model still needs better context?
In my experience, once you start fixing those gaps, the results change quite dramatically. Modular’s skills is a great start, but they would need to instruct about benchmarks, hot-loops, self improvements, test coverage, linting, etc. AFAIK all of that are still missing.
I think we often underestimate the handicap an LLM is operating under when we ask it to optimize or implement a feature.
Owen, imagine this scenario: it’s your first day at Modular. Someone wipes your memory (Men in Black style), so you don’t even know what Mojo is. You sit down, get access to the Modular repo, which you’ve never seen before, and the only thing you’re allowed to read is a single CLAUDE.md / AGENTS.md file. Then, within a couple of minutes, you’re asked to implement a feature that requires a hot loop optimization using linear types in Mojo. In two minutes the feature needs to be ready.
You’re an incredible programmer, but I’d be very surprised if you could pull that off under those conditions.
And yet, that’s basically the default setup for most AI agents today. You just type claude or opencode, and the model is dropped into that exact situation, minimal context, unfamiliar codebase, tight constraints, and in Mojo case, no idea of the programming language, and expected to perform at a high level.
To me, what LLMs are achieving these days is not just what a great programmer could do, they’re operating at a superhuman level.
I have most of that set up, including a harness that’s derived from formal verification tooling (TLA+) which acts as a ground truth for externally visible behavior. As for context, I have a dataset that I convert into a RAG db for new models which has most of the information I personally reference when trying to write high performance code, as well as all of the notes from my undergrad and research. It’s enough information that I think I could reasonably expect a human to learn to write high performance code from it.
It’s possible it’s because I am mixing multiple areas which are poorly represented in the training data (network drivers, a custom network stack, large scale distributed systems, very tight integration with relatively new hardware, and taking advantage of hardware capabilities that most OSes don’t actually expose) combined with a few other things that make the problem a bit more tricky, and that’s what throws LLMs off. However, even getting LLMs to do something simple like handle endianness properly is like pulling teeth at times, because most of the examples of the task in the training data are actually incorrectly and I have a feeling “network stack development” doesn’t see as much fine-tuning as ReactJS does from closed models. I’ve even gone to the extent of fine-tuning my own models which, at least subjectively, despite running on consumer hardware, seem to perform much better, although the brute force approach I use tends to ruin the ability of the models to write JS and Python code.
This isn’t a Mojo specific complaint, I find most models have problems writing Rust and C++ at what I consider “speed of the hardware” too. In particular, they really like not doing null pointer checks, and when they do null pointer checks on a buffer they almost never vectorize it.
I probably could fix a lot of these with LLMs, but I can write the correct implementation faster than I can prod the LLM into fixing it.
I think you are being too pessimistic on the AI usage. I’m referring to polishing PRs and synthetic using LLMs I think the human in the loop still has to define the architecture of the PRs
I don’t think optimism or pessimism about the efficacy of AI tools is the core question here. @mzaks your point about the burden of ongoing responsibility for maintenance is appreciated.
I agree that the efficacy is off topic, the core question is about whether we should open the floodgates. The point I was attempting to make is that almost everything in the Mojo stdlib is going to require careful review and consideration, and that means everyone who is trying to help maintain Mojo has a heightened maintenance burden. If you combine that with the ability to have people easily sling around 10k line PRs, I think that’s a recipe to grind the already stretched process to a halt.
Considering that @mzaks brought up a 161 year old Javens Paradox in comparison to how influx of AI generated PRs can be reviewed is quite excellent though a bit unlogical to me.
Here is the case and what happens when comparing 1800s coal to 2026 technology.
“ I agree that AI automated PRs shall flood but if it floods then construct an efficient reservoir"
What I’m implying is that if previewing AI automated PRs gets tough then apply AI assisted previewing.
Their is a classic difference between Vibe coding with AI assisted Engineering.
Note: For me I see AI as a tool that pushes my logic and capabilities to 100% and even beyond limits because I don’t use AI to think for me I use it to boost my thinking.
I know you guys still gonna be skeptical but look at this YouTube video:
Hope you’ll understand what I’m referring to dude. In anyway I’ll still comply in further corrections.
I think a suiting metaphor to what you are suggesting is “fight fire with fire”.
This strategy works in already unmanageable situations. E.g. the forest ecosystem is already so broken that we need to put small controlled fires in order to prevent an apocalyptic event.
My questions are, do we have to put ourselves into such unmanageable situation? What are the benefits of going into a direction where such event can occur?
Again I am not against assisted AI coding. Maintainers can do whatever helps them. I am concerned with contributors vibe coding and gaming the open source reward system.
Bro I understand what you imply. But unmanageable is a heavy term. I’m not saying let the AI do all the previewing but allow the maintainer to preview the top 1%.
The greatest threat to this is hallucinations so which definitely imply we are still in beta stage to apply automated PRs and automated previewing. But look at what’s happening deep in the world of AI automation.
Under “Best practices for AI-enhanced pull requests, combining rules, templates, CI checks, and human review to deliver faster, safer code”
Another look into that is:
I believe we still have measures that we can apply for this maneuver ![]()
But it’ll still be abit odd however I can explain further measures for such. Even though we can’t do AI automated PRs at scale right now.
Some experimentation on my end has found even some of the most capable current models to be quite error prone in evaluating quality and relevancy of review candidates. A professor of mine from back in uni has been doing ongoing research in this area and found that at least as a micro-takeaway, LLMs tend bias toward higher rankings for LLM output in a way that diverges from quality assessments and rankings that come from human review (though I’ll want to talk to her more how strongly this is shown). This suggests relatively foundational alignment issues that would come from LLM review of PRs, especially from a unified queue that would contain LLM generated submissions.
This has mirrored some experiments I’ve done with LLMs to do exploratory work for surfacing community pain points on some topics: they would generally very strongly surface LLM generated results that were often not terribly strong candidates for review, while burying other candidates that were much higher signal/noise ratio that happened to be human-written but did not have as much superficial detail.
Well then I’ll have to tell you that I have no university background associated with research on LLMs
because I’ve not gone to any university or college as I’m purely self taught.
For your professor I guess she really determined Sawyer ![]()
![]()
But since you noted LLMs are error prone and hallucinate that not because they are wrong it’s because of the etiquette they are trained and mainly they are Probabilistic. If you want an LLM that is mathematically proven to be 100% check on:
Source: Axiom Math Axiom
Axiom uses the Lean programming language as a Mathematical prover which makes a model reasoning logic Deterministic instead of Probabilistic.
Has your lecturer experimented in such ?
What do you think but I think this is the way we can make an AI reasoning atmost fit.
Note: I’m not a researcher just an enthusiastic architect who never took the traditional academic path doesn’t feel so bad to be seft taught (highschool grad) I got no limits.
I think if we kinda borrow the logic from Axiom we can halt the error prone dissatisfaction.