As the title states. i have discovered an efficiency anomaly in the mojo compiler, where a function is about 100 times slower than expected. How to proceed?
I can create a small repro case to illustrate it, so you compiler people can have a closer look at it, and hopefully fix it!
Is it alright to post source code here, inlined in a message? Or should i attach it somehow? It is just a few hundred lines.
At a few hundred lines, a git repo link is probably a better idea so we can easily get it onto our own computers and run a profilers.
Iâll figure out how to create one, thanks! And once i have it and can share it, create a new topic?
Put the link in this thread.
let me know if it works (it it my first time i share a github link)
Have a look at the benchmark timings of distance_Env3x3_L1() versus distance_Env4x4_L1(). The latter is incredibly slow compared to the former, while it it is in fact quite similar. It is about 67 times slowerâŚ
I finally got some time to work on it and making things a little more high level helped a lot:
#### RGBA8 #####################################################################
@value
@register_passable("trivial")
struct RGBA8:
var rgba: SIMD[DType.uint8, 4]
fn __init__(
out self, r: UInt8 = 0, g: UInt8 = 0, b: UInt8 = 0, a: UInt8 = 0
):
self.rgba = SIMD[DType.uint8, 4](r, g, b, a)
fn __str__(self) -> String:
txt = String()
txt += "r" + String(self.rgba[0])
txt += "g" + String(self.rgba[1])
txt += "b" + String(self.rgba[2])
txt += "a" + String(self.rgba[3])
return txt
#### Env3x3 ####################################################################
@value
struct Env[x: Int, y: Int]:
alias size = x * y
var env: InlineArray[RGBA8, size = Self.size]
fn __init__(out self, fill: RGBA8 = RGBA8(0, 0, 0, 255)):
self.env = InlineArray[RGBA8, size = Self.size](fill=fill)
fn __str__(self) -> String:
txt = String()
for i in range(len(self.env)):
txt += String(self.env[i])
if i != len(self.env) - 1:
txt += "\n"
return txt
fn __getitem__(self, id: UInt) -> RGBA8:
return self.env[id]
alias Env3x3 = Env[3, 3]
alias Env4x4 = Env[4, 4]
#### utilities #################################################################
fn abs_delta[width: Int](owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> SIMD[DType.uint8, width]:
return (a.cast[DType.int16]() - b.cast[DType.int16]()).cast[DType.uint8]()
#### distance ##################################################################
fn distance_Env_L1[x: Int, y: Int](a: Env[x, y], b: Env[x, y]) -> Float32:
var d: SIMD[DType.uint64, 4] = 0
@parameter
for e in range(x * y):
d += abs_delta(a[e].rgba, b[e].rgba).cast[DType.uint64]()
# normalize (divide by 9*4*255)
alias norm_fac = 1.0 / 9180.0
return norm_fac * Float32(d.reduce_add())
That version is as fast as the SIMD versions on my Zen 4 laptop, and the abs_delta is now branchless.
Thanks for this interesting generalization that Mojo makes possible! It is certainly surprising to see that there is no abstraction cost here.
The suggested code doesnât produce correct answers, at a first glance i think it is because the abs_delta() function isnât doing what it is supposed to be doing.
Iâll figure this out.
The point i wanted to make is that the mojo compiler can generate code with large efficiency differences for functions that are very similar. Not so much about how to achieve the fastest possible code.
For those interested, the proper abs_delta() implementation is:
fn abs_delta[width: Int](
owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) ->
SIMD[DType.uint8, width]:
vmin = min(a, b)
vmax = max(a, b)
return vmax - vmin
This runs at a good speed (similar to the flawed implementation).
Youâre correct, I forgot an abs
:
fn abs_delta[width: Int](owned a: SIMD[DType.uint8, width], owned b: SIMD[DType.uint8, width]) -> SIMD[DType.uint8, width]:
return abs(a.cast[DType.int16]() - b.cast[DType.int16]()).cast[DType.uint8]()
I think what was causing the performance issues was the large amount of branching, which meant that a lot of optimization passes couldnât match on the code you were writing. Itâs possible that the opt-pass couldnât deal with the sparse access pattern for 4 items, but has a pass that deals with (x, y, z) or (r, g, b) specifically.
Yes i was thinking along similar lines. What worried me was when i compared the Mojo code to equivalent Go code, which didn´t suffer from the slow execution. Go is ~3x slower in this test, but it doesn´t have an anomalous difference in the timing of distance_Env3x3_L1 and distance_Env4x4_L1.
Many times we will need plain scalar code, so it is important that the âslow pathsâ are still reasonably fast.
distance_Env4x4_L1
is the normal speed, thatâs what you get when the compiler doesnât have opt passes which match what youâre doing. The 3x3 is going to be special cased to some degree due to 3d math optimizations. Mojo isnât aiming to magically improve performance in all cases, as you said, distance_Env4x4_L1
is still faster than Go, so it did itâs job. If you work with the compiler a bit more, you get big speedups. distance_Env4x4_L1
is very un-ideomatic, so I donât think that mojo-specific passes would easily catch it either. But, as I showed, you can have fully generic versions which are equally fast for less effort, which is what the compiler expects you to do and it can optimize for that very well.
I agree that Mojo has great efficiency potential, of which i scratched the surface only. I need more practice
This was my first coding exercise using the language, to evaluate it, comparing a tiny real-world problem i have to solve by writing it in two different languages.
One of Mojoâs goals is to avoid âfragileâ optimizations, and I think you may have found one here. If you want to open a bug with a more stripped-down version of the code, I think the compiler is failing somewhere since the optimization breaking going from 3 â 4 is very odd.
Yes it is definitely odd, and whatâs most surprising is the super slow speed which seems to be slower than totally non-optimized code?
Iâll keep an eye on this. If i have time i might be able to find a minimal repro case.
The compiler is doing a lot of optimizations, if you look at the disassembly, itâs partially vectorizing everything. It just misses a few for that once case for some reason.
How can i see the disassembly?
compile._internal_compile_code
Thanks!
I couldnât easily find this kind of functionality in the documentation. Iâll have a look at it.