Proposal: Bounds checking and removing negative indexing

Hi all,

See the proposal for bounds checking defaults and removing negative indexing here.

I understand there will be users that like using e.g. x[-1] instead of x[len(x) - 1], and would like to remain closer to Python. Please have a read to understand our reasoning for wanting to make this tradeoff, in the systems and GPU programming context. And if you still have concerns or comments please raise them here.

Thanks!

6 Likes

I like the proposal.

The ergonomic cost is minimal, x[-1] becomes x[len(x) - 1] , and we can add a.last() or .from_end(i) method if desired.

As discussed on discord it would be nice to have a .last() method. x[-1] is the most common use-case for negative indexing and used approx. 26 times in the stdlib.

6 Likes

I agree that bounds checking by default is good, and that removing negative indexing in pursuit of that is sensible.

On GPU it’s actively harmful, GPU architectures execute in lockstep warps/wavefronts, and divergent branches cause both paths to execute serially. An extra branch on every element access in a tight kernel is a measurable performance regression.

More critically, on GPU we disable bounds checking for this exact reason, branching overhead is unacceptable on hot paths.

Here, I disagree. If you get divergence in bounds checking on a GPU, it means one thread is going off the end of a buffer and your code is incorrect, full stop, do no pass go, do not collect 200 tokens. At that point, you’re going to be tearing down the kernel and shipping what debug information you can to the user, so performance is no longer a concern.

Additionally, I think that part of democratizing GPU programming means bringing some of the “creature comforts” of CPUs over to GPUs, potentially at performance cost, to make programming GPUs easier. There is a decent variety of mistakes a new user could easily make when first learning GPU programming that would lead to the first or last threads of a kernel to go off of a buffer, and I think the somewhat unfamiliar programming model may actually make that more likely for new users. Given that on a consumer GPU there’s a decent change that the memory on other side is mapped due to a lack of vram and Mojo seeming to generally prefer to not use the virtual addressing APIs. This memory pressure also means that classic ways of dealing with this, such as no-read, no-write guard pages, will be difficult to implement.

This leaves a question on how to implement the fast path, and I see two options. First, the exact same thing as is done on CPU, .unsafe_get(...). This makes the language more consistent, since each bit of hardware doesn’t need to be picked as “more CPU-like” or “more GPU-like” with bounds safety based on the likely subjective determination made by whoever brings up the architecture. Here are a few examples of things which heavily blur that line to show why I think this is subjective at best:

  • The Bolt Graphics Zeus “GPU”
    • This is a GPU that uses an out of order RISC-V core with up to 6-10 wide dispatch, has a very substantial amount of raytracing acceleration, has a 2048-bit VLEN (512-bit DLEN). It also natively runs Linux. What I have been shown about the chip leads me to believe that the claims of beating a 5090 in raytracing are credible. So, does Mojo leave bounds checking on, since it’s RISC-V, or is it disabled since it has a large (presented) vector width, or is it back on since it’s out of order?
  • The AiNekko ET-SoC-1 (formerly from Esperanto)
    • 1088 cores of in-order RISC-V, each with a 256-bit vector unit and 512-bit tensor unit, with GPU-like scratchpad memory and caches requiring gpu-like explicit synchronization. There’s work to make it act as an “llvmpipe offload” to make it into an open hardware GPU since the RTL is also in the process of being open sourced.
  • Intel’s Larrabee in the original “it has display out” config (Later turned into Xeon Phi)
    • It’s a huge pile of Atom cores, now called ecores. The lineage out to Xeon Phi definitely falls under the “accelerator” category, but it’s also just a big pile of x86 cores that can run Linux. These cores were in-order with SMT 4 and AVX512, so getting somewhat GPU-like.
  • I can produce more examples if that is insufficent, but I think my point is made.

All of those listed examples are places where I’d want to be able to re-use GPU SIMT kernels, especially giving the highly composable direction of newer kernels. This means that I’d want the same unchecked access that GPUs want so that indexing operations map more closely to scatter/gather. I also think that this would harshly limit the ability to grab libraries designed for CPU and throw them on GPUs for GPGPU things, especially for cases where the iGPU would otherwise go unused.

The other option is to lean hard into dependent types and do this the way Ada does it, which provides safety and runtime performance at compile-time cost. This would mean that comptime_range[0, 10]() would produce type Integer[min=0, max=10], which can be shown at compile-time to be a valid to index a TileTensor[DType.bfloat16, Layout.row_major(32), ...] without any need for bounds checking. I see this as the ideal way to handle this problem for performance sensitive code, since it provides both performance and correctness benefits and unlocks a lot of other neat tricks.

1 Like

Here, I disagree. If you get divergence in bounds checking on a GPU, it means one thread is going off the end of a buffer and your code is incorrect, full stop, do no pass go, do not collect 200 tokens. At that point, you’re going to be tearing down the kernel and shipping what debug information you can to the user, so performance is no longer a concern.

That’s a really good point, and running a benchmark supports it, it’s about 0.66% perf loss for a kernel that just does 10,000 accesses. That’s pretty close to CPU studies like this one showing 0.881%. I’ll reword design doc to account for that.

Additionally, I think that part of democratizing GPU programming means bringing some of the “creature comforts” of CPUs over to GPUs, potentially at performance cost, to make programming GPUs easier. There is a decent variety of mistakes a new user could easily make when first learning GPU programming that would lead to the first or last threads of a kernel to go off of a buffer.

Yeah currently it’s a little awkward, but you can turn on GPU bounds checking for debugging and testing with -D ASSERT=all. @Joe has been experimenting with adding build flags to make it more sane and obvious, we’ll have to figure out the best defaults and how much granularity to expose. For example, do we have default bounds checking on for both CPU and GPU and have a --release flag or something similar to turn them off.

This leaves a question on how to implement the fast path, and I see two options. First, the exact same thing as is done on CPU, .unsafe_get(…).

I don’t think making .unsafe_get() the main way to turn off bounds checking in kernels is the right way to go. We want SOTA perf for any accelerators we support, and so it’d be littered heavily throughout the codebase. I think controlling this through build flags is the better option.

This makes the language more consistent, since each bit of hardware doesn’t need to be picked as “more CPU-like” or “more GPU-like”

Yeah consistency would be great, if mojo build and mojo run e.g. have bounds checking on by default for both CPU and GPU, and we have build flags for turning them off. Everyone has a different take on what’s best in terms of the defaults and granularity of build flags, so we’ll have to figure it out in a separate design. But while we currently don’t have these build flags, CPU default on and GPU default off is the best way to go.

The other option is to lean hard into dependent types and do this the way Ada does it, which provides safety and runtime performance at compile-time cost.

We do have a comptime bounds check on InlineArray if the index is a parameter with __getitem_param__. We could potentially add comptime bounds checking to TileTensor when shape_known is true, I’ll give it a go and see if it works.

I think we can take some inspiration from Zig here. Zig has Debug, ReleaseSafe, ReleaseFast, and ReleaseSmall. ReleaseSafe is effectively -O3 -D ASSERT=all, and I think it’s a reasonable default, especially if Mojo can be made to print something like “Building in Release mode with lightweight runtime safety checks” to let users know that those checks can be turned off. ReleaseSmall (-Oz) is very useful for targets with small op caches, or some kinds of dataflow accelerators. I think that having a flag on graph builds to ask MAX to strip the bounds checking would be entirely reasonable for something like the MAX server where the kernels are known to be heavily tested and correct.

I think we can mitigate the annoyance of writing it by adding an LSP action to do the replacement (ideally in both directions), which would also make the migration easier. I see where you’re coming from on the annoyance, but I’m of the opinion that the default should be safe and then you opt in to unsafety for more performance. If that’s a problem for MAX, then using an escape hatch function to change the default for it but not other code seems sensible.

Alternatively, I have an idea that should work for this. This is actually the kind of problem this was designed for, just replace the ErrorHandling example with BoundsChecking.

I think it should be doable even without __getitem_param__.

struct IntRange[min: Int, max: Int](CoordLike, Movable, ImplicitlyCopyable, ...): # traits that SIMD has
    var _value: Int

    fn __init__(out self, var value: Int) raises:
        if value < Self.min or value > Self.max:
            raise RangeError()

        self = Self(value, unchecked=True)

    fn __init__(out self, var value: Int, *, unchecked: Bool = False):
        self._value = value

    comptime CanExpressRange[dtype: DType]: Bool = Scalar[dtype].MIN <= Self.min and Scalar[dtype].MAX >= Self.MAX
    comptime IsIntegral[dtype: DType]: Bool = dtype.is_integral()

    comptime VariadicType = Variadic.filter_types[CanExpressRange, Variadic.filter_types[IsIntegral,  DType.all]] #DTypes.all

    comptime static_value: Int = -1
    """The compile-time value if statically known, -1 otherwise."""

    comptime is_static_value = False
    """True if the value is known at compile time."""

    comptime is_tuple = False
    """True if this is a tuple type (`Coord`), False for scalar values."""

    comptime is_value = not Self.is_tuple
    """True if this is a scalar value, False for tuple types."""

    comptime DTYPE = VariadicType[0]
    """The data type for runtime values, or invalid for compile-time values."""

    # Note that unlike the __len__() from Sized, this is a static method.
    @staticmethod
    def __len__() -> Int:
        """Get the number of elements in this type.

        Returns:
            The number of elements (1 for single values, >1 for tuples).
        """
        return 1

    def value(self) -> Int:
        """Get the integer value of this coordinate.

        Only valid for value types (not tuples).

        Returns:
            The integer value.
        """
        return self._value

    def tuple(var self) -> Coord[*Self.VariadicType]:
        """Get this coordinate as a `Coord` tuple.

        Only valid for tuple types.

        Returns:
            The coordinate as a `Coord` tuple.
        """
        return Coord(self._value)

    def product(self) -> Int:
        """Calculate the product of all elements.

        Returns:
            The product of all elements.
        """
        return self._value

    def sum(self) -> Int:
        """Calculate the sum of all elements.

        Returns:
            The sum of all elements.
        """
        return self._value    

And then you add a few more required comptime members to check the value range of the individual members, the product and the sum at compile time. If the value ranges of the coord can’t exceed what the buffer can safely be indexed by, then you don’t have any reason to do a runtime bounds check. This doesn’t need to be as strict as “comptime only” and also doesn’t require unrolling loops to elide bounds checks. It might require making a special version of range that needs comptime-known bounds, but that shouldn’t be that bad.

Another thing to consider is the use case of List versus Tensor are very different. Since List can reallocate, negative indexing is a lot more dynamic than a comptime-sized defined Tensor.

The pattern could be that when you want maximally performant parallel code, you just use List as the backing store and drop into Tensor mode. Then document that List supports negative indexing and Tensor does not (a thing easily checked at runtime in DEBUG builds). The closer a structs gets to the metal, the less likely negative indexing is supported.

Obvious downside: it’s potentially confusing for it to be supported in some places and not others.

Yeah I quite like Zigs build options, many differrent opinions though, we’ll figure out build flags in a different proposal.

I think we’re on the same page that it’s good to have an unsafe_get to never assert. I thought you meant that should be the only way to turn off bounds checks, including on GPU which wouldn’t be practical. I was just saying that it’s better to control subscripting bounds checks through build flags. Most of the collections have an unsafe_get and we’ll leave them there.

Obvious downside: it’s potentially confusing for it to be supported in some places and not others.

Yeah that’s the trouble, want things to be consistent across collections. But you can build your own collection that supports negative indexing!

I really like this proposal.

That said, in terms of priorities, I’d much rather see Mojo bring back runtime bounds checking for List. Its current behavior feels so unsafe.

1 Like

My 2c coming from a primarily Python application dev with interest in performance and efficiency but also interested in safe code (computer vision work - so C++ is also involved, hopefully to be replaced with Mojo): I very much dislike the proposal and the drift from Python conventions and semantics.

At the same time, I understand the interest in very high performance, which pretty much comes at the expense of safe code and convenience. However, you have to acknowledge that long term, most of the developers won’t do this kind of work in Mojo.

I would like to see the “default”, straight-forward, simple code to be safe and have behavior and semantics as close as possible to Python. For the high performance stuff, a compromise in “user-friendliness” can be made. Concretely, I would propose something like x[^i^] to be the non bounds-checked, non negative-index-allowed version. For the multi dimensional case it would be [^i,j,k^], essentially turning [^...^] into an operator (dunder __unsafegetitem__ and __unsafesetitem__). Alternatively, the sigil could act as a “decorator” for the indexer, resulting in x[^i, ^j, ^k] but I don’t know exactly how feasible it is to implement something like this, and it is also harder to type.

In a similar vein, I think integer division a / b should act as it does in Python (i.e. return a float), and I propose introducing a ^/ division operator (a ^/ b) to return the same type as the operands.

Ofc, there are other symbols options instead of ^, e.g. # or !

As far as I’m concerned this is an really exciting improvement!
I do think too that index going negative should be an separated calculation.
That default behavior that ensure top performance,
from which to build on top seem super important. :+1:

1 Like

Hi Turcik,

Thanks for the interesting suggestion. Having a ^..^ or other special symbol to opt-in to indexing performance would require having it littered through kernel code, which is the majority of Mojo in the wild today. We’d have to update a huge chunk of our codebase if we went down that path, and educate people on what this special symbol means, as Mojo programmers would encounter it very early. I don’t think I’d be able to convince our kernel programmers that this was the right direction, and they’re a very important audience both internally and externally.

At the same time, I understand the interest in very high performance, which pretty much comes at the expense of safe code and convenience

Here we are making a slight tradeoff for convenience, in that x[-1] will need to be written as x[len(x)-1], but not at the cost of safety, as this allows bounds checking to be turned on by default on CPU. What we get back for the convenience tradeoff is simplicity and performance by default. The other tradeoff is Python similarity and I understand that concern, but I hope that the compile-time errors make it easy to transition Python code to Mojo. In our codebase there are only a dozen or so places where we actually used negative indexing, and it was very easy and fast to make the updates for the draft PR.

I totally understand your concerns with drifting from Python. However, we really want out of bounds indexing on CPU to throw assertion errors, and this was the best set of tradeoffs with Mojo’s focus on performance to make it happen. You can absolutely build your own package with collections that are closer to Python though, and I’d love to see that!

The ergonomic cost does bother me. When I have to write it out in other languages, I miss Python. But as the proposal points out, negative indexes are almost always used in the literal syntax. It is usually just shorthand for x[len(x) - y] in the same way that x += y is for x = x + y, and I think this is really its main benefit. So rather than having __getitem__ handle a negative value, could the compiler handle the negative sign as part of the brace syntax for __getitem__ and automatically compile x[-y] as x[len(x) - y]? This would be simply on the basis of the negative sign being the first character after the opening brace or after a slice colon, nothing about math within the braces. Then the typical, idiomatic use cases of negative offsets could remain valid, even if y = -1; x[y] would not. I think losing that latter case would hardly be noticed, and this would be among those differences between Mojo and Python, where it mostly works out the same despite being very different under the hood.

The negative-sign syntactic sugar might have to be restricted to where a variable is being indexed, because automatically rewriting foo()[-1] as foo()[len(foo()) - 1] would, of course, be very bad. Or maybe this could still be handled well by the compiler, treating it as something like temp = foo(); temp[len(temp) - 1]. But if not, the restriction wouldn’t be so bad.

The compiler would need to know from the collection type whether this handling of the negative sign is to be done, since __getitem__is not only for 0-based offsets. There could be a boolean class attribute to flag this or an alternative name to __getitem__, maybe __getoffset__, that preempts __getitem__ for implementing[], and this would be the way to opt in to the negative-sign handling. Then using __getoffset__you could have bounds checking with hardly any cost for users.

I think we should add a trait that is something like this

trait Indexable(Sized):
  comptime IndexElement: AnyType
  comptime IndexingException: AnyType = Never

  def __getitem__(
    self, idx: Some[Indexer]
  ) raises Self.IndexingException -> Self.IndexElement: ...

  def __getitem__(
    self, *, neg: Some[Indexer]
  ) raises Self.IndexingException -> Self.IndexElement:
    return self[len(self) - Int(neg)]

  def last(self) -> Self.IndexElement:
    return self[len(self) - 1]

that way we allow a simple way to opt into negative indexing arr[neg=-2] or avoid having to reimplement the .last() method everywhere.

Implementing multi-dimensional negative indexing would also be achievable with something like

  def __getitem__(
    self, *, neg: VariadicPack[element_trait=Indexer, ...] # or using Tuple
  ) raises Self.IndexingException -> Self.IndexElement: ...
1 Like

Interesting idea thanks @Derek will discuss with team

Yeah something like this would be nicer than .last() or .from_back(x) etc.

I like it. Just the keyword should be more expressive, as neg only refers to how it would read in Python and not to what it does. Maybe rev for “reverse” indexing would be better?

I understand that reinterpreting the - is tempting, but I am concerned that this potential two is really something you should know but that you may miss easily.

This would be expected to work but fail in practice:

if condition:
    idx = -1
else:
    idx = 0
value = collection[idx]

This is also an example where the simple keyword fails (though at least it is obvious there).

list[-1:][0] slicing does the job :wink:

var list = [1, 2, 3]
for idx in range(-3, 3):
    print(list[idx:][0])
# prints 1 2 3 1 2 3 (negative and positive indexing in the same loop)

I still like list[bidi: idx] because it would also work with positive and negative indexing. bidi could be replaced by any keyword that indicates bidirectional indexing. But I’m not sure whether it’s worth it because in most use cases negative indexing is only used for -1 and list.last() would do this job.

I actually much prefer .first() and .last() over [0] / [-1].
So many languages have than and it’s very expressive, with clear intent :nerd_face:

Ofc it’s still totally possible to have all of those for people to pick their favorite. But sometimes also less options is more :smiley:

1 Like

Kind of weird that negative indexing works for slices but not for indices. But actually, for slicing I find negative indexing particularly important.

Classic example:

m = (a[1:] + a[:-1]) / 2