I have discovered a suspect efficiency anomaly in the mojo compiler, how to proceed?

notnot · February 26, 2025, 11:44pm

As the title states. i have discovered an efficiency anomaly in the mojo compiler, where a function is about 100 times slower than expected. How to proceed?
I can create a small repro case to illustrate it, so you compiler people can have a closer look at it, and hopefully fix it!
Is it alright to post source code here, inlined in a message? Or should i attach it somehow? It is just a few hundred lines.

owenhilyard · February 26, 2025, 11:47pm

At a few hundred lines, a git repo link is probably a better idea so we can easily get it onto our own computers and run a profilers.

notnot · February 26, 2025, 11:50pm

I’ll figure out how to create one, thanks! And once i have it and can share it, create a new topic?

owenhilyard · February 26, 2025, 11:51pm

Put the link in this thread.

notnot · February 27, 2025, 1:08am

let me know if it works (it it my first time i share a github link)

Have a look at the benchmark timings of distance_Env3x3_L1() versus distance_Env4x4_L1(). The latter is incredibly slow compared to the former, while it it is in fact quite similar. It is about 67 times slower…

owenhilyard · February 28, 2025, 4:44pm

I finally got some time to work on it and making things a little more high level helped a lot:

#### RGBA8 #####################################################################


@value
@register_passable("trivial")
struct RGBA8:
    var rgba: SIMD[DType.uint8, 4]

    fn __init__(
        out self, r: UInt8 = 0, g: UInt8 = 0, b: UInt8 = 0, a: UInt8 = 0
    ):
        self.rgba = SIMD[DType.uint8, 4](r, g, b, a)

    fn __str__(self) -> String:
        txt = String()
        txt += "r" + String(self.rgba[0])
        txt += "g" + String(self.rgba[1])
        txt += "b" + String(self.rgba[2])
        txt += "a" + String(self.rgba[3])
        return txt


#### Env3x3 ####################################################################


@value
struct Env[x: Int, y: Int]:
    alias size = x * y
    var env: InlineArray[RGBA8, size = Self.size]

    fn __init__(out self, fill: RGBA8 = RGBA8(0, 0, 0, 255)):
        self.env = InlineArray[RGBA8, size = Self.size](fill=fill)

    fn __str__(self) -> String:
        txt = String()
        for i in range(len(self.env)):
            txt += String(self.env[i])
            if i != len(self.env) - 1:
                txt += "\n"
        return txt

    fn __getitem__(self, id: UInt) -> RGBA8:
        return self.env[id]


alias Env3x3 = Env[3, 3]
alias Env4x4 = Env[4, 4]


#### utilities #################################################################


fn abs_delta[width: Int](owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> SIMD[DType.uint8, width]:
    return (a.cast[DType.int16]() - b.cast[DType.int16]()).cast[DType.uint8]()


#### distance ##################################################################


fn distance_Env_L1[x: Int, y: Int](a: Env[x, y], b: Env[x, y]) -> Float32:
    var d: SIMD[DType.uint64, 4] = 0
    @parameter
    for e in range(x * y):
        d += abs_delta(a[e].rgba, b[e].rgba).cast[DType.uint64]()

    # normalize (divide by 9*4*255)
    alias norm_fac = 1.0 / 9180.0
    return norm_fac * Float32(d.reduce_add())

That version is as fast as the SIMD versions on my Zen 4 laptop, and the abs_delta is now branchless.

notnot · February 28, 2025, 11:37pm

Thanks for this interesting generalization that Mojo makes possible! It is certainly surprising to see that there is no abstraction cost here.

The suggested code doesn’t produce correct answers, at a first glance i think it is because the abs_delta() function isn’t doing what it is supposed to be doing.
I’ll figure this out.

The point i wanted to make is that the mojo compiler can generate code with large efficiency differences for functions that are very similar. Not so much about how to achieve the fastest possible code.

notnot · February 28, 2025, 11:52pm

For those interested, the proper abs_delta() implementation is:

fn abs_delta[width: Int](
    owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> 
        SIMD[DType.uint8, width]:
    vmin = min(a, b)
    vmax = max(a, b)
    return vmax - vmin

This runs at a good speed (similar to the flawed implementation).

owenhilyard · March 1, 2025, 4:25pm

You’re correct, I forgot an abs:

fn abs_delta[width: Int](owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> SIMD[DType.uint8, width]:
    return abs(a.cast[DType.int16]() - b.cast[DType.int16]()).cast[DType.uint8]()

owenhilyard · March 1, 2025, 4:28pm

I think what was causing the performance issues was the large amount of branching, which meant that a lot of optimization passes couldn’t match on the code you were writing. It’s possible that the opt-pass couldn’t deal with the sparse access pattern for 4 items, but has a pass that deals with (x, y, z) or (r, g, b) specifically.

notnot · March 2, 2025, 12:15am

Yes i was thinking along similar lines. What worried me was when i compared the Mojo code to equivalent Go code, which didn´t suffer from the slow execution. Go is ~3x slower in this test, but it doesn´t have an anomalous difference in the timing of distance_Env3x3_L1 and distance_Env4x4_L1.
Many times we will need plain scalar code, so it is important that the ‘slow paths’ are still reasonably fast.

owenhilyard · March 2, 2025, 2:09pm

distance_Env4x4_L1 is the normal speed, that’s what you get when the compiler doesn’t have opt passes which match what you’re doing. The 3x3 is going to be special cased to some degree due to 3d math optimizations. Mojo isn’t aiming to magically improve performance in all cases, as you said, distance_Env4x4_L1 is still faster than Go, so it did it’s job. If you work with the compiler a bit more, you get big speedups. distance_Env4x4_L1 is very un-ideomatic, so I don’t think that mojo-specific passes would easily catch it either. But, as I showed, you can have fully generic versions which are equally fast for less effort, which is what the compiler expects you to do and it can optimize for that very well.

notnot · March 3, 2025, 4:09pm

I agree that Mojo has great efficiency potential, of which i scratched the surface only. I need more practice
This was my first coding exercise using the language, to evaluate it, comparing a tiny real-world problem i have to solve by writing it in two different languages.

owenhilyard · March 3, 2025, 7:35pm

One of Mojo’s goals is to avoid “fragile” optimizations, and I think you may have found one here. If you want to open a bug with a more stripped-down version of the code, I think the compiler is failing somewhere since the optimization breaking going from 3 → 4 is very odd.

notnot · March 3, 2025, 10:28pm

Yes it is definitely odd, and what’s most surprising is the super slow speed which seems to be slower than totally non-optimized code?

I’ll keep an eye on this. If i have time i might be able to find a minimal repro case.

owenhilyard · March 4, 2025, 12:17am

The compiler is doing a lot of optimizations, if you look at the disassembly, it’s partially vectorizing everything. It just misses a few for that once case for some reason.

notnot · March 4, 2025, 12:31am

How can i see the disassembly?

owenhilyard · March 4, 2025, 12:59am

compile._internal_compile_code

sora · March 7, 2025, 3:49pm

@notnot Seeing the MLIR/LLVM or ASM code generated by Mojo - #5 by sora

notnot · March 8, 2025, 1:37am

Thanks!
I couldn’t easily find this kind of functionality in the documentation. I’ll have a look at it.

Topic		Replies	Views
Introduce yourself :wave: General	40	682	March 7, 2025
Ask Joe anything about the Mojo standard library! :fire: Mojo ask-me-anything	10	347	December 10, 2024
Seeing the MLIR/LLVM or ASM code generated by Mojo Mojo discussion	5	279	December 14, 2024
Need Mentors for GSOC'25 General discussion	4	114	April 11, 2025
Structured Async for Mojo Mojo discussion , mojo-language-design , mojo-proposal	2	193	January 28, 2025

I have discovered a suspect efficiency anomaly in the mojo compiler, how to proceed?

Related topics