Vectorize and remainder loops

It seems like vectorize in the standard library doesn’t vectorize the remainder loops. I’m just wondering why that is the case and if it would make sense to also vectorize the remainder loops.

Take the following code for example:

from algorithm.functional import vectorize

fn main():
    @parameter
    fn print_width[width: Int](i: Int):
        print(width, end=" ")

    vectorize[print_width, 16](31)

This prints out “16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1”, which shows that the 15 remainder iterations are executed with a width of 1.

It seems like it should be possible to do something like:

from bit.bit import bit_width

fn myvectorize[
    func: fn[Int] (Int) capturing -> None, simd_width: Int
](size: Int):
    alias pow_of_2 = bit_width(simd_width) - 1
    var offset = 0

    @parameter
    for i in range(pow_of_2, -1, -1):
        alias w = 2**i
        for _ in range((size - offset) // w):
            func[w](offset)
            offset += w

fn main():
    @parameter
    fn print_width[width: Int](i: Int):
        print(width, end=" ")

    myvectorize[print_width, 16](31)

which would print “16 8 4 2 1” showing that the remainder loops are still vectorized – just with smaller widths.

I guess its a tradeoff between a bit of efficiency (which becomes negligible as the size grows) and binary size?

Hello! Just to clarify, this is intended as per the vectorize docs. There are two function overloads:

  1. size passed at runtime: vectorize[func, simd_width, unroll_factor](size)

    • Remainders processed separately (width=1).
  2. size passed at compile-time: vectorize[func, simd_width, size, unroll_factor]()

    • Processes strictly power-of-2 remainders in a single iteration.
    • Remainder loop will unroll for performance improvements when not an exponent of 2.
2 Likes

That does work for your particular loop, but what about vectorize[print_width, 16](20)? That should only activate the 4-wide case. In a lot of cases, unless you have more information, it’s very difficult to generalize this without tons of branches with AVX2 or NEON-like vector ISAs. Now, if you have AVX512 or SVE, in a lot of cases masks will work, which lets you do the drain loop in one step very cheaply. Almost all new-ish datacenter CPUs have AVX-512, and even a decent amount of consumer (last 2 generations of AMD), and as a result for most number-crunching code it’s a bit lower priority to have the drain loop be as optimal, since for non-masked ISAs it costs more to figure it out, and for masked ISAs you would align_down and do the last remainder in a single pass.

I’m not 100% sure what you mean by “activate”, but vectorize[print_width, 16](20) does print 16 4. If you mean “activate” as in the branch (size - offset) // w == 0 needs to be checked for widths 8, 4, 2, and 1 in the remainder loops (really, remainder statements since there would be no loop after the complation), then I think I understand.
And then your point is that on AVX2 and NEON-like vector ISAs, the log2(w) branches required for the remainder statements have the potential to cost more than the (up to) w remainder loop iterations?

When you say “for masked ISAs you would align_down and do the last remainder in a single pass”, do you mean that the mojo compiler will most likely just optimize the remainder loop to do this?

I’ll also note that in the case where the size is known at compile time, there would be no branches necessary to run the remainder statements. These remainder statements with smaller widths must certainly be better than a linear unrolls loop in that case, no?

If you have a compile-time unknown size, and have something like AVX512, you need to check every single power of 2 on the way down the drain loop. That’s up to 7 branches for byte handling, which is quite a bit of branches to put in a hot loop. At some point, it’s better to just let the scalar ALUs take the problem rather than eat all of those compares and branches. If it’s known at compile time, go ahead and vectorize it all perfectly. However, when you start adding up the cost of 7 compares and branches, with the deep data dependency that creates for the “current index” variable. However, with AVX512 you get masks, so it’s not an issue there. For AVX2, you get 256 bits, or up to 6 compares, but only up to 32 scalar ops. That already gets a bit iffy on some uarches, especially if the rest of your code is vector heavy and you’re likely right behind a load of vector ops. You might be able to finish FASTER by doing a jump and scalar ops depending on pipeline depth.

Mojo is not particularly good at using masked vector ISAs except explicitly, so that would probably need to be written by hand.

Completely agree – just making sure those were the branches you were talking about.

Yup, I agree again. I was originally thinking of the case where unroll_factor=1 and the conditionals could just be checking whether the ith bit is set in the original length Int so you would avoid a deep dependency, but I can also see how this would get tricky.
I still think it could be useful when someone wants to vectorize a function over the same range over and over again – in which case I’m guessing the branch predictor would help with “extra” compares.
But I can understand why mojo may not want that behavior by default.

Also for a check, I tried looking to see what gcc and clang would do for a memcpy-like function in C – it also looks like they do a single wide copy loop followed by a scalar remainder loop.

But part of the reason I was asking about this is because I remember seeing a Julia package a while back that claimed that they were beating LLVM in dot product benchmarks specifically because LLVM wasn’t vectorizing the remainder loops: GitHub - JuliaSIMD/LoopVectorization.jl: Macro(s) for vectorizing loops.