How to iterate over a `List` using `SIMD` in Mojo

from sys.info import simdwidthof


fn main():
    values = List[UInt8](1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    length = len(values)  # 10

    for element in values:
        _ = element  # iterates per element

    ptr = values.unsafe_ptr()

    for i in range(length // 4):  # 2
        var vec = (ptr + i * 4).load[width=4]()
        _ = vec  # it's a SIMD[DTYpe.uint8, 4]

    alias width = simdwidthof[UInt8]()

    for i in range(length // width):  # depends on the CPU
        var vec = (ptr + i * width).load[width=width]()
        _ = vec  # it's a SIMD[DTYpe.uint8, width]

    # the rest of the data i.e. length % width can be processed sequentially
2 Likes

Does it make sense to try to make an API for this which produces a SIMD list iterator, leaving the drain loop to the user? I’m not sure if we want to keep using unsafe for this.

I think this will be better handled by adding map, filter, sum, reduce, etc. It feels weird to give an incomplete iterator, but maybe we can if they aren’t achievable with those APIs we could offer a helper function that does that :man_shrugging: (I wouldn’t want to add it to the type’s public API though)

2 Likes

You’re right) Less room for mistakes this way)

For iterators, would we want a SIMDIterator which takes a tuple of functions and implement something which takes a width parameter on top of that? I’m thinking of masked processing (like SVE), letting you process a variable number of elements and do the drain loop in a single iteration with different intrinsics in some cases.

That feels like a very powerful abstraction, many types can yield SIMD vectors. I’m thinking we could simply make SpanIterator do that (for now, or maybe permanently).

iterator = iter(span)
for v in iterator.vectors():
    _ = v # SIMD[DType.?, simdwidthof[DType.?]()]
for s in iterator.scalars():
    _ = s # Scalar[DType.?]
1 Like

The reason I’m leaning towards asking for multiple functions is to handle drain loops inline, or deal with platforms that natively have masks and will let you write one function for all of it like SVE.

1 Like

wouldn’t that be a low-level optimization that we could bake into Span.map()'s implementation ?

We can have a version which takes a scalar and drops it into algorithm.functional.vectorize, but part of Mojo’s appeal is letting people use those low-level optimizations. If we don’t expose it in some way, we violate the zero-overhead principle because someone could write a better one themselves. From the “tuple/comptime list of functions” version, we can implement more friendly and easier to use ones, like handling one which is generic over width or taking a single vector and a scalar, or just a scalar.