Llama2.🔥 performance degradation after updating code to the latest Mojo compiler

Hi everyone.

I haven’t touched my llama2.:fire: github repo for a long time. More precisely since March 2024

I have finally refactored this legend today. Mojo compiler ver is 0.25.7.0

https://github.com/tairov/llama2.mojo

What I have immediately noticed - is the significantly degraded performance :eyes:

Onstories15M.bin model on my Mac M1 - it shows ~170 tokens/sec throughput ..

Though on Mojo version 24.3 it shows ~1000 tok/sec
yes, I still have a Mojo compiler from March 2024

Ofc, I might have some parts of the calculations done not in optimal way, not sure, still need to investigate..

Most of the SIMD calculations are done over UnsafePointers, avoiding any heavy data copying, so essentially it reproduce similar approach I used in first versions of llama2, using custom struct Matrix

So I don’t see any reasons why the code itself might be not optimal.

If anyone from the Mojo compiler team could have take a look and share insights on why there’s such a severe degradation ( and probably how to fix it :smiley: ) I’d really appreciate it

Just in case older version code is here - GitHub - tairov/llama2.mojo at old-mojo-24.3

1 Like

After doing a deep round of profiling on macOS with Instruments (which was actually new to me - didn’t know it exist), I finally tracked down an issue.
With a mix of manual profiling and a few back-and-forths with Gemini 3.0, I found the culprit: an Accumulator struct that was being allocated on the heap and tanking performance.
After switching it to stack_allocation, the performance actually ended up higher than before :fire: :clap:

7 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.