Mojo auto-vectorization for generated DSP code

Hi everyone, I am developing a Mojo backend for FAUST, a functional DSL for audio DSP with mature C++
codegen, and studying which Mojo code shapes optimize best.

FAUST has scalar and vec modes. vec mode restructures code for target-compiler auto-vectorization; it does
not emit SIMD intrinsics.

In my benchmarks, Clang often auto-vectorizes scalar mode and also exploits vec mode. Scalar mode can
outperform vec mode because it avoids FAUST vec overhead. In Mojo, the vec-mode overhead is not compensated
by the expected packed SIMD, while scalar mode does not show comparable auto-vectorization.

  • Why is Mojo not auto-vectorizing these loops?
  • Am I generating a code shape that prevents Mojo or LLVM from recognizing the loop as vectorizable?
  • Is explicit SIMD the intended path in Mojo for this kind of performance-sensitive code?

More broadly, is Mojo expected to develop stronger auto-vectorization over time, or is the current design
direction to prefer explicit SIMD? I am trying to understand whether this is a compiler limitation or
whether the generated backend shape should be changed.

Here is a simplified Mojo compute method, the hot loop of a generated FAUST DSP (in scalar mode).

comptime dfaust = get_defined_dtype["DType.float32"]()
comptime FaustFloat = SIMD[dfaust, 1]
def compute(
    mut dsp,
    var count: S32,
    var inputs: UnsafePointer[UnsafePointer[FaustFloat, ReadUntrackedOrigin], ReadUntrackedOrigin],
    var outputs: UnsafePointer[UnsafePointer[FaustFloat, MutUntrackedOrigin], MutUntrackedOrigin]
) -> None:
    var input0 = inputs[S32(0)]
    var output0 = outputs[S32(0)]
    var output1 = outputs[S32(1)]
    var slow0 = F64(dsp.hslider0)
    for var i0 in range(S32(0), count):
        var temp0 = F64(input0[i0])
        var temp1 = (temp0) * ((1.0) - ((slow0) * (pow_unrolled[2](temp0))))
        output0[i0] = FaustFloat(temp1)
        output1[i0] = FaustFloat(temp1)

FaustFloat is the external driver precision, typically f32. The internal DSP precision can be either
single or double (as it is in this example).

The equivalent C++ scalar loop is auto-vectorized by Clang, here a sample from the resulting assembly:

ldr     q3, [x16], #16
fcvtl   v4.2d, v3.2s
fcvtl2  v3.2d, v3.4s
fmul.2d v5, v3, v3      ; packed 2xf64 mul
fmla.2d v6, v5, v1      ; packed 2xf64 fma
fcvtn   v4.2s, v4.2d
fcvtn2  v4.4s, v3.2d
str     q4, [x15], #16

So Clang auto-vectorizes the scalar loop, just as it does for the vec version of the same source.

The Mojo assembly for the same loop remains scalar and is slower, despite being very small:

LBB0_1:
    ldr  s2, [x9, x8]
    fcvt d2, s2
    fmul d3, d2, d2
    fmsub d3, d3, d0, d1
    fmul d2, d3, d2
    fcvt s2, d2
    str  s2, [x10, x8]
    str  s2, [x11, x8]
    add  x8, x8, #4
    cmp  x8, #256
    b.ne LBB0_1

In vec mode, the generated shape does not auto-vectorize either; instead, it increases memory traffic and worsens performance. I can share further evidence if useful.

Thanks for reading, and sorry for the long post. I am not an expert in this area, and I am still learning Mojo, so please be patient if I missed something. I hope this can start a useful discussion.

Mojo intentionally disable’s LLVM’s autovectorizer, since Mojo ships with a portable SIMD library in the stdlib. This makes SIMD code far less brittle and often more portable. In compute kernels, one of the reasons people drop to assembly is because the autovectorizer can actually get in the way of high performance code by vectorizing your scalar path when your vector ALUs are already occupied. It is also massively easier to break up SIMD ops than to do autovec, so it helps compile times as well.

Here’s how I would write the code you provided:

def compute[dtype: DType](
    mut dsp,
    var count: S32,
    var inputs: UnsafePointer[UnsafePointer[FaustFloat, ReadUntrackedOrigin], ReadUntrackedOrigin],
    var outputs: UnsafePointer[UnsafePointer[FaustFloat, MutUntrackedOrigin], MutUntrackedOrigin]
) -> None:
    comptime width = simd_width_of[dtype]()
    var input0 = inputs[S32(0)]
    var output0 = outputs[S32(0)]
    var output1 = outputs[S32(1)]
    var slow0 = Float64(dsp.hslider0)
    
    for var i0 in range(S32(0), count, width): #assumes that count % width == 0, otherwise do a drain loop as normal or use `SIMD.select` to get a masked operation. 
        var temp0 = input0.load[width=width](i0).cast[DType.float64]()
        var temp1 = (temp0) * ((1.0) - ((slow0) * (pow_unrolled[2](temp0))))
        output0.store(i0, temp1)
        output1.store(i0, temp1)

I’m not familiar with Faust, but it looks like it’s lowering from relatively high level declarative code, so doing this vectorize transform should be relatively simple at some level of the compiler. If you want to use existing helpers in Mojo, you could also write like this, since it might be easier to emit:

def compute(
    mut dsp,
    var count: S32,
    var inputs: UnsafePointer[UnsafePointer[FaustFloat, ReadUntrackedOrigin], ReadUntrackedOrigin],
    var outputs: UnsafePointer[UnsafePointer[FaustFloat, MutUntrackedOrigin], MutUntrackedOrigin]
) -> None:
    comptime simd_width = simd_width_of[FaustFloat.dtype]()
    var input0 = inputs[S32(0)]
    var output0 = outputs[S32(0)]
    var output1 = outputs[S32(1)]
    var slow0 = Float64(dsp.hslider0)
    
    def loop_body[width: Int](i: Int) {mut}:
        var temp0 = input0.load[width=width](i).cast[DType.float64]()
        var temp1 = (temp0) * ((1.0) - ((slow0) * (pow_unrolled[2](temp0))))
        output0.store[width=width](i, temp1)
        output1.store[width=width](i0, temp1)
    
    vectorize[width](count, loop_body) #handles drain loop