EmberRegex: An Exercise in Vibe Coding in Mojo

Historically I’ve admittedly been a bit of a pessimist about AI generated coding, “Sure it’s impressive and fun for messing around, but it doesn’t really scale to complex projects”. In recent times it’s safe to say I’ve been proven wrong. Mojo in particular has been difficult to use with LLMs due to limited data in their training sets vs other languages like Python and Javascript. So how far can you get with a fairly minimal setup (no special extra training or context other than the newly released skills). Well turns out the answer is actually pretty far.

I’ve had great success employing LLMs to find and fix tough bugs, and implement small features I couldn’t be bothered to write myself in EmberJSON. The past few days I decided to dive in and try to “vibe code” an entire project from scratch. I landed on a regex library as it satisfied a few key criteria.

  1. Very unit testable
    • Having good unit tests acts as guardrails against regressions caused by the somewhat unpredictable nature of LLM generated code
  2. Easy to benchmark
    • Acts as hard evidence to guide optimization efforts
  3. Something I know nothing about and probably wouldn’t take the time to implement myself
    • A suboptimal version of something is better than nothing

So introducing EmberRegex! A pure mojo implementation of regular expressions created almost exclusively by the grace of Claude Opus 4.6. For more in depth examples you can refer to the (also entirely AI generated) README in the repo.

from emberregex import compile

def main() raises:
    var re = compile("[a-z]+")
    var result = re.search("hello world")
    if result:
        print(result)  # MatchResult(start=0, end=5)

The repo also includes a fairly extensive benchmark suite comparing performance to the builtin python re module.

════════════════════════════════════════════════════════════════════════
  Results  (lower µs = faster;  ratio = Python ÷ Ember, >1x = Ember wins)
════════════════════════════════════════════════════════════════════════

  Benchmark                           Ember (µs)  Python (µs)   Ratio  Bar (10x = full)
  ────────────────────────────────────────────────────────────────────────────────
  throughput_literal_100B                  0.030        0.233    7.8x  ███████████████░░░░░
  throughput_literal_10KB                  0.230        4.213   18.3x  ████████████████████
  throughput_literal_100KB                 1.900       24.815   13.1x  ████████████████████
  throughput_literal_1MB                  17.410      247.505   14.2x  ████████████████████
  throughput_class_10KB                    3.910       72.085   18.4x  ████████████████████
  throughput_nomatch_100KB                 1.770       24.789   14.0x  ████████████████████
  anchor_bol                               0.020        0.219   11.0x  ████████████████████
  anchor_eol                               0.030        0.221    7.4x  ██████████████░░░░░░
  anchor_word_boundary                     0.080        0.289    3.6x  ███████░░░░░░░░░░░░░
  anchor_word_boundary_miss                0.040        0.317    7.9x  ███████████████░░░░░
  anchor_bol_miss_10KB                     0.010        0.079    7.9x  ███████████████░░░░░
  multiline_bol_findall_100_lines          3.650       15.303    4.2x  ████████░░░░░░░░░░░░
  multiline_eol_findall_100_lines          5.190       59.882   11.5x  ████████████████████
  dotall_multiline_body                    0.040        0.103    2.6x  █████░░░░░░░░░░░░░░░
  named_group_date                         0.070        0.145    2.1x  ████░░░░░░░░░░░░░░░░
  named_group_email                        0.100        0.169    1.7x  ███░░░░░░░░░░░░░░░░░
  positional_group_email                   0.100        0.167    1.7x  ███░░░░░░░░░░░░░░░░░
  neg_lookahead                            0.140        0.253    1.8x  ███░░░░░░░░░░░░░░░░░
  neg_lookbehind                           0.120        0.252    2.1x  ████░░░░░░░░░░░░░░░░
  password_validation_lookahead            0.230        0.355    1.5x  ███░░░░░░░░░░░░░░░░░
  alternation_4                            0.010        0.099    9.9x  ███████████████████░
  alternation_16                           0.010        0.101   10.1x  ████████████████████
  alternation_16_miss                      0.020        0.083    4.1x  ████████░░░░░░░░░░░░
  findall_3_matches                        0.210        0.217    1.0x  ██░░░░░░░░░░░░░░░░░░
  findall_100_matches                      2.570       12.071    4.7x  █████████░░░░░░░░░░░
  findall_500_dot_matches                  9.120       14.015    1.5x  ███░░░░░░░░░░░░░░░░░
  replace_50_matches                       2.390        6.914    2.9x  █████░░░░░░░░░░░░░░░
  replace_named_backref                    0.540        0.295    0.5x  █░░░░░░░░░░░░░░░░░░░
  split_100_parts                          2.430        8.666    3.6x  ███████░░░░░░░░░░░░░
  pathological_optional_16                 0.030     1256.715  41890.5x  ████████████████████
  pathological_dotstar_anchored_5K         1.610        1.349    0.8x  █░░░░░░░░░░░░░░░░░░░
  pathological_dotstar_miss_5K             1.640        3.186    1.9x  ███░░░░░░░░░░░░░░░░░
  pathological_triple_backref              0.120        0.112    0.9x  █░░░░░░░░░░░░░░░░░░░
  realworld_url_parse                      0.220        0.526    2.4x  ████░░░░░░░░░░░░░░░░
  realworld_phone                          0.020        0.297   14.8x  ████████████████████
  realworld_hex_color                      0.020        0.222   11.1x  ████████████████████
  realworld_semver                         0.100        0.394    3.9x  ███████░░░░░░░░░░░░░
  realworld_key_value_findall              0.430        0.844    2.0x  ███░░░░░░░░░░░░░░░░░
  realworld_html_tag_findall               0.470        0.717    1.5x  ███░░░░░░░░░░░░░░░░░
  realworld_ws_normalize                   0.240        0.743    3.1x  ██████░░░░░░░░░░░░░░
  realworld_log_search_1000_lines          4.610       10.073    2.2x  ████░░░░░░░░░░░░░░░░
  inline_ignorecase                        0.020        0.239   12.0x  ████████████████████
  inline_multiline_search                  0.080        0.310    3.9x  ███████░░░░░░░░░░░░░
  engine_dfa_no_capture                    0.010        0.264   26.4x  ████████████████████
  engine_pike_with_capture                 0.070        0.267    3.8x  ███████░░░░░░░░░░░░░
  engine_backtrack_with_backref            0.130        0.239    1.8x  ███░░░░░░░░░░░░░░░░░
  compile_wide_char_class                  4.640       13.047    2.8x  █████░░░░░░░░░░░░░░░
  compile_8_groups                        34.670       40.544    1.2x  ██░░░░░░░░░░░░░░░░░░
  compile_nested_alternation              24.540       33.997    1.4x  ██░░░░░░░░░░░░░░░░░░
  ────────────────────────────────────────────────────────────────────────────────
  EmberRegex faster: 46  |  slower: 3

  Times are µs per operation (best of 5 × 1000 iterations).
════════════════════════════════════════════════════════════════════════

For those interested I figured I would also document my process for creating this library so far.

Initial Implementation

I had Claude create a 7 step plan for implementing the library. This simplified the work needing to be done at each step, and allowed for tests to be progressively implemented to avoid major regressions. I’m not knowledgeable enough about these systems to make any other claims about the benefits of this approach, but I do think Claude has a much easier time implementing things this way then if I had asked it to tackle it all in one attempt.

Performance Optimizations

After all the features were done, it was time to start optimizing! The initial performance was quite poor, losing over 50% of the benchmark cases to python. I gave Claude one go over of “make this thing faster” and that improved a few cases, but appeared generally too unspecific to create good results. Afterwards, I identified groups of 1-3 cases I figured were related and asked it to fix the performance of those particular cases. This approach was highly successful as it allowed it to trace what logic paths those inputs would take and identify specific bottlenecks. There were still a handful of issues that I ended up fixing myself (such as copying a value instead of moving it when it was possible to do so), but I’ll chalk that up to LLMs still not having a perfect picture of what idiomatic Mojo looks like.

After enough iterations of doing this, only 3 cases remain where we are still slower than re, and with additional effort (or tokens rather) I imagine it will soon be faster in all cases.

Going forward

Perhaps the barrier between “vibe coded slop” and actually useable software is simply how much money you have to give Anthropic. Surely this effort could have been completed in a few hours had I not been constantly hitting my session usage limits on the base-level Claude Code plan. For now I will continue pushing this project forward, and look forward to any feedback and input from the community!

7 Likes

Any thoughts on applying The Impossible Optimization, and the Metaprogramming To Achieve It ?

Thats a great idea, I unfortunately forgot about that article.

That’s awesome!

Nice. Did you use Modular’s new Mojo skills at all while building it?

I believe the mojo-syntax skill ended up being fairly helpful

1 Like

For me the way I use LLMs is quite different I don’t just use one LLM I use several. For example my main ones are Google Gemini 3 pro and Anthropics Claude.

One thing I noticed is that Claude is very impressive in utilizing previously trained data to generate amazing results such as the Claude C Compiler written in Rust seems like a copy of MLIR.

On the other hand Gemini is quite awesome in generating new patterns based on previous information. So if you combine their intelligence it’s massive.

Good point, this may also fall under the umbrella of “the limit is how much money I’m willing to spend on tokens”. All the different models have their strengths and tendencies so having access to multiple different ones is useful.

Spending money in tokens isn’t pain it’s a merit in this case of LLMs.

But the greatest threat is on hallucinations however we might have some measures to do for that.

You can look at the solutions I gave and how we can implement them on systems like EmberRegex.

In the spirit of the project. I gave Claude the article and asked it to implement it. It doesn’t quite necessarily inline everything since some patterns still create recursive cycles in the current implementation, but that can probably be fixed eventually, there does seems to still be a significant speedup in some cases due to eliminating all those dead branches however.

Thanks again to @Verdagon for writing that post!

2 Likes

That’s great to see!

I was thinking that if you could have used pikeVM instead of NFA (Non deterministic finite automata state) you might achieve better performance in emberregex/static_backtrack.mojo

Standard object graphs involve heavy pointer-chasing, which triggers CPU cache misses. By using a Pike VM approach, you flatten the NFA into a contiguous array of instructions.

Speed: Contiguous memory allows the CPU (or NPU) to prefetch instructions efficiently.

Efficiency: Smaller memory footprint (bytes per state vs. objects) keeps the entire state machine in the L1/L2 cache.

Information Source : Google Gemini 3.1 Pro

Try it this way?

I have been experimenting with that actually for certain pathological cases that would otherwise invoke exponential behaviour in the NFA. The PikeVM implementation I have is currently much slower in every case I’ve tried than the compile time optimized NFA in StaticRegex however due to the high constant overheard in PikeVm, though there may be some optimizations to be made in the PileVM with the extra compile information we have in this case.

Since it solves those Pathological cases most far crashes then experimenting on using it is quite a meticulous approach.

If you’ve experimented separately on Emberregex I’ll like to see the code base

May I have the Link for the updates or not yet?

I been spinning through IEEE research papers and websites that’s mostly because LLMs couldn’t give me enough intelligence on this PikeVM.

I’ve been looking on optimizations here is what I extracted.

While the PikeVM is slower than a standard NFA or DFA because it must track and copy capture group offsets for every “thread,” several optimizations can significantly close the gap:

I think the most efficient form of optimization is Initial DFA implementation by using a fast DFA first.

Source: GeeksforGeeks Difference between DFA and NFA - GeeksforGeeks

Another case of what I found is this:

How do you see this approach dude?

I generated a benchmark comparison between my StaticRegex and the pcre2 JIT compiled implementation. I don’t know much about pcre2 so I’m assuming this is a fair comparison, but it appears it’s not only competitive but beats it in many cases!