Proposal: Bounds checking and removing negative indexing

Hi all,

See the proposal for bounds checking defaults and removing negative indexing here.

I understand there will be users that like using e.g. x[-1] instead of x[len(x) - 1], and would like to remain closer to Python. Please have a read to understand our reasoning for wanting to make this tradeoff, in the systems and GPU programming context. And if you still have concerns or comments please raise them here.

Thanks!

5 Likes

I like the proposal.

The ergonomic cost is minimal, x[-1] becomes x[len(x) - 1] , and we can add a.last() or .from_end(i) method if desired.

As discussed on discord it would be nice to have a .last() method. x[-1] is the most common use-case for negative indexing and used approx. 26 times in the stdlib.

5 Likes

I agree that bounds checking by default is good, and that removing negative indexing in pursuit of that is sensible.

On GPU it’s actively harmful, GPU architectures execute in lockstep warps/wavefronts, and divergent branches cause both paths to execute serially. An extra branch on every element access in a tight kernel is a measurable performance regression.

More critically, on GPU we disable bounds checking for this exact reason, branching overhead is unacceptable on hot paths.

Here, I disagree. If you get divergence in bounds checking on a GPU, it means one thread is going off the end of a buffer and your code is incorrect, full stop, do no pass go, do not collect 200 tokens. At that point, you’re going to be tearing down the kernel and shipping what debug information you can to the user, so performance is no longer a concern.

Additionally, I think that part of democratizing GPU programming means bringing some of the “creature comforts” of CPUs over to GPUs, potentially at performance cost, to make programming GPUs easier. There is a decent variety of mistakes a new user could easily make when first learning GPU programming that would lead to the first or last threads of a kernel to go off of a buffer, and I think the somewhat unfamiliar programming model may actually make that more likely for new users. Given that on a consumer GPU there’s a decent change that the memory on other side is mapped due to a lack of vram and Mojo seeming to generally prefer to not use the virtual addressing APIs. This memory pressure also means that classic ways of dealing with this, such as no-read, no-write guard pages, will be difficult to implement.

This leaves a question on how to implement the fast path, and I see two options. First, the exact same thing as is done on CPU, .unsafe_get(...). This makes the language more consistent, since each bit of hardware doesn’t need to be picked as “more CPU-like” or “more GPU-like” with bounds safety based on the likely subjective determination made by whoever brings up the architecture. Here are a few examples of things which heavily blur that line to show why I think this is subjective at best:

  • The Bolt Graphics Zeus “GPU”
    • This is a GPU that uses an out of order RISC-V core with up to 6-10 wide dispatch, has a very substantial amount of raytracing acceleration, has a 2048-bit VLEN (512-bit DLEN). It also natively runs Linux. What I have been shown about the chip leads me to believe that the claims of beating a 5090 in raytracing are credible. So, does Mojo leave bounds checking on, since it’s RISC-V, or is it disabled since it has a large (presented) vector width, or is it back on since it’s out of order?
  • The AiNekko ET-SoC-1 (formerly from Esperanto)
    • 1088 cores of in-order RISC-V, each with a 256-bit vector unit and 512-bit tensor unit, with GPU-like scratchpad memory and caches requiring gpu-like explicit synchronization. There’s work to make it act as an “llvmpipe offload” to make it into an open hardware GPU since the RTL is also in the process of being open sourced.
  • Intel’s Larrabee in the original “it has display out” config (Later turned into Xeon Phi)
    • It’s a huge pile of Atom cores, now called ecores. The lineage out to Xeon Phi definitely falls under the “accelerator” category, but it’s also just a big pile of x86 cores that can run Linux. These cores were in-order with SMT 4 and AVX512, so getting somewhat GPU-like.
  • I can produce more examples if that is insufficent, but I think my point is made.

All of those listed examples are places where I’d want to be able to re-use GPU SIMT kernels, especially giving the highly composable direction of newer kernels. This means that I’d want the same unchecked access that GPUs want so that indexing operations map more closely to scatter/gather. I also think that this would harshly limit the ability to grab libraries designed for CPU and throw them on GPUs for GPGPU things, especially for cases where the iGPU would otherwise go unused.

The other option is to lean hard into dependent types and do this the way Ada does it, which provides safety and runtime performance at compile-time cost. This would mean that comptime_range[0, 10]() would produce type Integer[min=0, max=10], which can be shown at compile-time to be a valid to index a TileTensor[DType.bfloat16, Layout.row_major(32), ...] without any need for bounds checking. I see this as the ideal way to handle this problem for performance sensitive code, since it provides both performance and correctness benefits and unlocks a lot of other neat tricks.

Here, I disagree. If you get divergence in bounds checking on a GPU, it means one thread is going off the end of a buffer and your code is incorrect, full stop, do no pass go, do not collect 200 tokens. At that point, you’re going to be tearing down the kernel and shipping what debug information you can to the user, so performance is no longer a concern.

That’s a really good point, and running a benchmark supports it, it’s about 0.66% perf loss for a kernel that just does 10,000 accesses. That’s pretty close to CPU studies like this one showing 0.881%. I’ll reword design doc to account for that.

Additionally, I think that part of democratizing GPU programming means bringing some of the “creature comforts” of CPUs over to GPUs, potentially at performance cost, to make programming GPUs easier. There is a decent variety of mistakes a new user could easily make when first learning GPU programming that would lead to the first or last threads of a kernel to go off of a buffer.

Yeah currently it’s a little awkward, but you can turn on GPU bounds checking for debugging and testing with -D ASSERT=all. @Joe has been experimenting with adding build flags to make it more sane and obvious, we’ll have to figure out the best defaults and how much granularity to expose. For example, do we have default bounds checking on for both CPU and GPU and have a --release flag or something similar to turn them off.

This leaves a question on how to implement the fast path, and I see two options. First, the exact same thing as is done on CPU, .unsafe_get(…).

I don’t think making .unsafe_get() the main way to turn off bounds checking in kernels is the right way to go. We want SOTA perf for any accelerators we support, and so it’d be littered heavily throughout the codebase. I think controlling this through build flags is the better option.

This makes the language more consistent, since each bit of hardware doesn’t need to be picked as “more CPU-like” or “more GPU-like”

Yeah consistency would be great, if mojo build and mojo run e.g. have bounds checking on by default for both CPU and GPU, and we have build flags for turning them off. Everyone has a different take on what’s best in terms of the defaults and granularity of build flags, so we’ll have to figure it out in a separate design. But while we currently don’t have these build flags, CPU default on and GPU default off is the best way to go.

The other option is to lean hard into dependent types and do this the way Ada does it, which provides safety and runtime performance at compile-time cost.

We do have a comptime bounds check on InlineArray if the index is a parameter with __getitem_param__. We could potentially add comptime bounds checking to TileTensor when shape_known is true, I’ll give it a go and see if it works.

I think we can take some inspiration from Zig here. Zig has Debug, ReleaseSafe, ReleaseFast, and ReleaseSmall. ReleaseSafe is effectively -O3 -D ASSERT=all, and I think it’s a reasonable default, especially if Mojo can be made to print something like “Building in Release mode with lightweight runtime safety checks” to let users know that those checks can be turned off. ReleaseSmall (-Oz) is very useful for targets with small op caches, or some kinds of dataflow accelerators. I think that having a flag on graph builds to ask MAX to strip the bounds checking would be entirely reasonable for something like the MAX server where the kernels are known to be heavily tested and correct.

I think we can mitigate the annoyance of writing it by adding an LSP action to do the replacement (ideally in both directions), which would also make the migration easier. I see where you’re coming from on the annoyance, but I’m of the opinion that the default should be safe and then you opt in to unsafety for more performance. If that’s a problem for MAX, then using an escape hatch function to change the default for it but not other code seems sensible.

Alternatively, I have an idea that should work for this. This is actually the kind of problem this was designed for, just replace the ErrorHandling example with BoundsChecking.

I think it should be doable even without __getitem_param__.

struct IntRange[min: Int, max: Int](CoordLike, Movable, ImplicitlyCopyable, ...): # traits that SIMD has
    var _value: Int

    fn __init__(out self, var value: Int) raises:
        if value < Self.min or value > Self.max:
            raise RangeError()

        self = Self(value, unchecked=True)

    fn __init__(out self, var value: Int, *, unchecked: Bool = False):
        self._value = value

    comptime CanExpressRange[dtype: DType]: Bool = Scalar[dtype].MIN <= Self.min and Scalar[dtype].MAX >= Self.MAX
    comptime IsIntegral[dtype: DType]: Bool = dtype.is_integral()

    comptime VariadicType = Variadic.filter_types[CanExpressRange, Variadic.filter_types[IsIntegral,  DType.all]] #DTypes.all

    comptime static_value: Int = -1
    """The compile-time value if statically known, -1 otherwise."""

    comptime is_static_value = False
    """True if the value is known at compile time."""

    comptime is_tuple = False
    """True if this is a tuple type (`Coord`), False for scalar values."""

    comptime is_value = not Self.is_tuple
    """True if this is a scalar value, False for tuple types."""

    comptime DTYPE = VariadicType[0]
    """The data type for runtime values, or invalid for compile-time values."""

    # Note that unlike the __len__() from Sized, this is a static method.
    @staticmethod
    def __len__() -> Int:
        """Get the number of elements in this type.

        Returns:
            The number of elements (1 for single values, >1 for tuples).
        """
        return 1

    def value(self) -> Int:
        """Get the integer value of this coordinate.

        Only valid for value types (not tuples).

        Returns:
            The integer value.
        """
        return self._value

    def tuple(var self) -> Coord[*Self.VariadicType]:
        """Get this coordinate as a `Coord` tuple.

        Only valid for tuple types.

        Returns:
            The coordinate as a `Coord` tuple.
        """
        return Coord(self._value)

    def product(self) -> Int:
        """Calculate the product of all elements.

        Returns:
            The product of all elements.
        """
        return self._value

    def sum(self) -> Int:
        """Calculate the sum of all elements.

        Returns:
            The sum of all elements.
        """
        return self._value    

And then you add a few more required comptime members to check the value range of the individual members, the product and the sum at compile time. If the value ranges of the coord can’t exceed what the buffer can safely be indexed by, then you don’t have any reason to do a runtime bounds check. This doesn’t need to be as strict as “comptime only” and also doesn’t require unrolling loops to elide bounds checks. It might require making a special version of range that needs comptime-known bounds, but that shouldn’t be that bad.

Another thing to consider is the use case of List versus Tensor are very different. Since List can reallocate, negative indexing is a lot more dynamic than a comptime-sized defined Tensor.

The pattern could be that when you want maximally performant parallel code, you just use List as the backing store and drop into Tensor mode. Then document that List supports negative indexing and Tensor does not (a thing easily checked at runtime in DEBUG builds). The closer a structs gets to the metal, the less likely negative indexing is supported.

Obvious downside: it’s potentially confusing for it to be supported in some places and not others.

Yeah I quite like Zigs build options, many differrent opinions though, we’ll figure out build flags in a different proposal.

I think we’re on the same page that it’s good to have an unsafe_get to never assert. I thought you meant that should be the only way to turn off bounds checks, including on GPU which wouldn’t be practical. I was just saying that it’s better to control subscripting bounds checks through build flags. Most of the collections have an unsafe_get and we’ll leave them there.

Obvious downside: it’s potentially confusing for it to be supported in some places and not others.

Yeah that’s the trouble, want things to be consistent across collections. But you can build your own collection that supports negative indexing!