I have discovered a suspect efficiency anomaly in the mojo compiler, how to proceed?

As the title states. i have discovered an efficiency anomaly in the mojo compiler, where a function is about 100 times slower than expected. How to proceed?
I can create a small repro case to illustrate it, so you compiler people can have a closer look at it, and hopefully fix it!
Is it alright to post source code here, inlined in a message? Or should i attach it somehow? It is just a few hundred lines.

At a few hundred lines, a git repo link is probably a better idea so we can easily get it onto our own computers and run a profilers.

I’ll figure out how to create one, thanks! And once i have it and can share it, create a new topic?

Put the link in this thread.

let me know if it works (it it my first time i share a github link)

Have a look at the benchmark timings of distance_Env3x3_L1() versus distance_Env4x4_L1(). The latter is incredibly slow compared to the former, while it it is in fact quite similar. It is about 67 times slower…

1 Like

I finally got some time to work on it and making things a little more high level helped a lot:

#### RGBA8 #####################################################################


@value
@register_passable("trivial")
struct RGBA8:
    var rgba: SIMD[DType.uint8, 4]

    fn __init__(
        out self, r: UInt8 = 0, g: UInt8 = 0, b: UInt8 = 0, a: UInt8 = 0
    ):
        self.rgba = SIMD[DType.uint8, 4](r, g, b, a)

    fn __str__(self) -> String:
        txt = String()
        txt += "r" + String(self.rgba[0])
        txt += "g" + String(self.rgba[1])
        txt += "b" + String(self.rgba[2])
        txt += "a" + String(self.rgba[3])
        return txt


#### Env3x3 ####################################################################


@value
struct Env[x: Int, y: Int]:
    alias size = x * y
    var env: InlineArray[RGBA8, size = Self.size]

    fn __init__(out self, fill: RGBA8 = RGBA8(0, 0, 0, 255)):
        self.env = InlineArray[RGBA8, size = Self.size](fill=fill)

    fn __str__(self) -> String:
        txt = String()
        for i in range(len(self.env)):
            txt += String(self.env[i])
            if i != len(self.env) - 1:
                txt += "\n"
        return txt

    fn __getitem__(self, id: UInt) -> RGBA8:
        return self.env[id]


alias Env3x3 = Env[3, 3]
alias Env4x4 = Env[4, 4]


#### utilities #################################################################


fn abs_delta[width: Int](owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> SIMD[DType.uint8, width]:
    return (a.cast[DType.int16]() - b.cast[DType.int16]()).cast[DType.uint8]()


#### distance ##################################################################


fn distance_Env_L1[x: Int, y: Int](a: Env[x, y], b: Env[x, y]) -> Float32:
    var d: SIMD[DType.uint64, 4] = 0
    @parameter
    for e in range(x * y):
        d += abs_delta(a[e].rgba, b[e].rgba).cast[DType.uint64]()

    # normalize (divide by 9*4*255)
    alias norm_fac = 1.0 / 9180.0
    return norm_fac * Float32(d.reduce_add())

That version is as fast as the SIMD versions on my Zen 4 laptop, and the abs_delta is now branchless.

Thanks for this interesting generalization that Mojo makes possible! It is certainly surprising to see that there is no abstraction cost here.

The suggested code doesn’t produce correct answers, at a first glance i think it is because the abs_delta() function isn’t doing what it is supposed to be doing.
I’ll figure this out.

The point i wanted to make is that the mojo compiler can generate code with large efficiency differences for functions that are very similar. Not so much about how to achieve the fastest possible code.

For those interested, the proper abs_delta() implementation is:

fn abs_delta[width: Int](
    owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> 
        SIMD[DType.uint8, width]:
    vmin = min(a, b)
    vmax = max(a, b)
    return vmax - vmin

This runs at a good speed (similar to the flawed implementation).

You’re correct, I forgot an abs:

fn abs_delta[width: Int](owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> SIMD[DType.uint8, width]:
    return abs(a.cast[DType.int16]() - b.cast[DType.int16]()).cast[DType.uint8]()

I think what was causing the performance issues was the large amount of branching, which meant that a lot of optimization passes couldn’t match on the code you were writing. It’s possible that the opt-pass couldn’t deal with the sparse access pattern for 4 items, but has a pass that deals with (x, y, z) or (r, g, b) specifically.

Yes i was thinking along similar lines. What worried me was when i compared the Mojo code to equivalent Go code, which didn´t suffer from the slow execution. Go is ~3x slower in this test, but it doesn´t have an anomalous difference in the timing of distance_Env3x3_L1 and distance_Env4x4_L1.
Many times we will need plain scalar code, so it is important that the ‘slow paths’ are still reasonably fast.

distance_Env4x4_L1 is the normal speed, that’s what you get when the compiler doesn’t have opt passes which match what you’re doing. The 3x3 is going to be special cased to some degree due to 3d math optimizations. Mojo isn’t aiming to magically improve performance in all cases, as you said, distance_Env4x4_L1 is still faster than Go, so it did it’s job. If you work with the compiler a bit more, you get big speedups. distance_Env4x4_L1 is very un-ideomatic, so I don’t think that mojo-specific passes would easily catch it either. But, as I showed, you can have fully generic versions which are equally fast for less effort, which is what the compiler expects you to do and it can optimize for that very well.

I agree that Mojo has great efficiency potential, of which i scratched the surface only. I need more practice :slight_smile:
This was my first coding exercise using the language, to evaluate it, comparing a tiny real-world problem i have to solve by writing it in two different languages.

One of Mojo’s goals is to avoid “fragile” optimizations, and I think you may have found one here. If you want to open a bug with a more stripped-down version of the code, I think the compiler is failing somewhere since the optimization breaking going from 3 → 4 is very odd.

Yes it is definitely odd, and what’s most surprising is the super slow speed which seems to be slower than totally non-optimized code?

I’ll keep an eye on this. If i have time i might be able to find a minimal repro case.

The compiler is doing a lot of optimizations, if you look at the disassembly, it’s partially vectorizing everything. It just misses a few for that once case for some reason.

How can i see the disassembly?

compile._internal_compile_code

@notnot Seeing the MLIR/LLVM or ASM code generated by Mojo - #5 by sora

1 Like

Thanks!
I couldn’t easily find this kind of functionality in the documentation. I’ll have a look at it.