`Indexer` and `Int` vs `UInt`

@clattner has expressed a desire to get rid of the Indexer trait and standardize on Int as the type used for indexing in Mojo. I disagree and think that, at a minimum, we should standardize on UInt but that preserving negative indexing may be worth it for some use-cases so the Indexer trait may be good to preserve.

I’ll start by saying that I’m a big fan of “make invalid states unrepresentable”. It provides the compiler more information to work with, and helps developers correctly use an API. I also think that allowing negative numbers near indexing which isn’t explicitly designed to handle it is a bad idea, since it often leads to checks like index < length and then a load or store at that offset from a pointer. I think it’s also relevant to look at large projects written in languages which have both signed and unsigned types to determine which are the most common kinds of indexing:

Project Count of Int Count of UInt Int Ratio UInt Ratio
Linux 22856 52770 0.302224 0.697776
DPDK 154 4641 0.032117 0.967883
The LLVM Monorepo 676 189792 0.003549 0.996451
swiftlang/swift 72 4278 0.016552 0.983448
AMD’s ‘TheRock’ and dependencies 3853 308603 0.012331 0.987669
VLC Media Player 798 5861 0.119838 0.880162
Chromium 1446 122794 0.011639 0.988361
rust-lang/rust 9373 30989 0.232223 0.767777
nvidia/TensorRT 1 1439 0.000694 0.999306
Postgres 267 2052 0.115136 0.884864
MongoDB 702 96109 0.007251 0.992749

Note: For C/C++ Int is ssize_t and UInt is size_t, for Rust, Int is isize and UInt is usize

For C/C++, run this bash command in the repository root: SSIZE=$(rg "ssize_t" | wc -l); BOTH=$(rg "size_t" | wc -l); printf "ssize_t=%d, size_t=%d, ssize_t_ratio=%f, size_t_ratio=%f\n" $SSIZE $(( $BOTH - $SSIZE )) $(awk "BEGIN { print $SSIZE / $BOTH }") $(awk "BEGIN { print ($BOTH - $SSIZE) / $BOTH }" )

For Rust: SSIZE=$(rg "isize" | wc -l); BOTH=$(rg "isize|usize" | wc -l); printf "ssize_t=%d, size_t=%d, ssize_t_ratio=%f, size_t_ratio=%f\n" $SSIZE $(( $BOTH - $SSIZE )) $(awk "BEGIN { print $SSIZE / $BOTH }") $(awk "BEGIN { print ($BOTH - $SSIZE) / $BOTH }" )

While not the most scientific survey of programmer habits, it still comes down staggeringly on one side. To me, what this says is that if we must have a default, it should be UInt. I’m open to adding more projects to the list, but these are fairly varied projects. Even Rust, which does not let you index with isize and forces you to cast it, still has a strong majority of usize, even with there being more i32 than u32 and more i64 than u64. For those not familiar with Rust, you may only index using usize, which means that indexing with anything aside from usize requires an explicit cast.

There are substantial ergonomics issues with using Int for indexing:

  • Every time a library or application calls len on a type they do not own and which does not specify it will never have a negative length, it must check for a negative length. This also means that anything generic of Sized must check for negative lengths.
  • A user will get an abort or other runtime error from passing in a negative value, instead of LSP feedback that the type does not support negative indexing.
  • Losing half of your address range on an NPU, where you may very well have a single array which takes up >50% of your address space since many are 32 bit, is distinctly not good.

I think that there are also going to be performance impacts from checking for negative values, so I’d like to propose an experiment. Take everywhere in LayoutTensor and NDBuffer that currently takes an unsigned integer or an Indexer and convert it to an Int, then add negative index handling to every public method of those types which takes an Int for indexing, including those that take IndexLists or similar, then use Modular’s benchmark suite to observe the performance impact. I looked into doing it myself but I’m not familiar enough with that part of MAX, especially the internals of tiling, to do it properly.

6 Likes

cc @gryznar @martinvuyk @sora (because I know you have options here) and @joe since you were part of the last time we had this debate.

1 Like

Regarding UInt, if we can use it, why not to do so if appropriate? It is strange for me to have such powerful tool and stick to Int even there where it has no sense (like with sizes). Normalizing to useInt everything looks for me like trying to solve the problem only with a hammer, ignoring other tools.

3 Likes

I assume you meant index >= 0.

I’m not sure you can deduce a good design for indexing by gathering UInt usage data from other languages. Other languages (and their stdlib) have tons of flaws. Who’s to say that their design for indexing is something Mojo should copy?

Yeah, I can see this being a massive pain. For example, imagine a library that wants to merge two collections. It might:

  • Get the lengths len1 = len(c1) and len2 = len(c2)
  • Allocate a buffer for len1 + len2 elements
  • Run a loop with range(len1) to initialize the first len1 elements
  • Run a loop with range(len2) to initialize the remaining len2 elements

The catch is: This algorithm will suffer from a buffer overflow if either of the lengths is negative. For example, if len2 = -3 then the sum of the lengths will be 3 elements shorter than the length of the first collection, so the loop that runs range(len1) will write past the end of the buffer.

This is a very subtle bug, and it’s one that would have been impossible if the lengths were UInts.

Update: I just remembered that Python’s len() function already validates that the value returned from __len__ is ≥ 0. If it isn’t, a ValueError is raised. So if Mojo mimics Python in this way then the above bug cannot happen. Unfortunately though, this will incur runtime overhead on every use of len(), not to mention the risk of crashing/raising. So this is still not a great approach.

That would be an annoying source of bugs, for sure! Especially for users coming from Python.

This scenario seems super niche, and you could work around it by using unsafe escape hatches etc. Therefore, I’m not convinced we need to find a design that accounts for this scenario. Making Mojo convenient for >99% of users is more important.

1 Like

What is exactly the plan?
Support negative indexes, rely on outlining and detect patterns like UInt63 to Int casts.
Don’t support negative indexes and deviate from python.

1 Like

Aside, it’s confusing to just limit numbers to powers of 2. If you say it’s import to limit the range of a number, then the range should probably be arbitrary. I guess this would mostly be a compile time thing, apart from layout optimizations.

Scalar[start: Int, end: Int]

I also think that allowing negative numbers near indexing which isn’t explicitly designed to handle it is a bad idea, since it often leads to checks like index < length

I assume you meant index >= 0.

What I was trying to say is that it is likely that people will only check index < length and forget about negatives, not doing the check you recommended.

think it’s also relevant to look at large projects written in languages which have both signed and unsigned types. […] To me, what this says is that if we must have a default, it should be UInt.

I’m not sure you can deduce a good design for indexing by gathering UInt usage data from other languages. Other languages (and their stdlib) have tons of flaws. Who’s to say that their design for indexing is something Mojo should copy?

I am attempting to dispel the notion that using Int/ssize_t is the default that programmers will be familiar with. At a minimum, UInt/size_t can’t be dismissed as a default because of a lack of familiarity. These are all large projects, all of which have had plenty of time to run into various use-cases, and the overwhelming majority is UInt/ssize_t.

Update: I just remembered that Python’s len() function already validates that the value returned from __len__ is ≥ 0. If it isn’t, a ValueError is raised. So if Mojo mimics Python in this way then the above bug cannot happen. Unfortunately though, this will incur runtime overhead on every use of len(), not to mention the risk of crashing/raising. So this is still not a great approach.

To me, this cuts against compile-time correctness, and functionally means that Python returns a UInt from len already, which is a compelling case for Mojo to use a UInt there.

Losing half of your address range on an NPU

This scenario seems super niche, and you could work around it by using unsafe escape hatches etc. Therefore, I’m not convinced we need to find a design that accounts for this scenario. Making Mojo convenient for >99% of users is more important.

The main problem is that we are essentially saying “you need to use another language” if you’re on a 32-bit NPU with 4 GB of attached memory doing ML things, for the purposes of making an indexing scheme which many actually be less ergonomic for people who care about robustness and memory safety and thus have to do negative checks on their inputs.

4 Likes

I agree, using UInt helps a lot with making invalid states unrepresentable.

Using the Indexer trait, we can support negative indexing with a guaranteed zero runtime cost to people using UInt. To me, this is the best of both worlds, if slightly annoying for library authors writing collections. It also means that if someone forgets to account for negative indexes and uses UInt, developers will have the LSP telling them “you need to make sure this isn’t negative” (by casting to UInt) instead of mysterious runtime crashes they need to debug.

5 Likes

The main reason to limit to powers of 2 is because that’s how hardware works. However, there is a strong argument for the type of construct you’re describing to exist. Ada has it because what it lets you do is create a scalar, and then when you do operations on it you get another type. For example, Scalar[0, 5] + Scalar[10, 11] returns a Scalar[10, 16]. This is very powerful for letting users be aware of overflow since the language can force you to do checked, wrapping or saturating math when overflow or underflow may occur. However, they are typically backed by a known-bit-width type.

Update: I just remembered that Python’s len() function already validates that the value returned from __len__ is ≥ 0. If it isn’t, a ValueError is raised. So if Mojo mimics Python in this way then the above bug cannot happen. Unfortunately though, this will incur runtime overhead on every use of len(), not to mention the risk of crashing/raising. So this is still not a great approach.

Using UInt as return in len() and crash at runtime won’t just happen

  1. This has performance implication on indexing, currently the Indexer trait is used in normalize_index to provide negative indexing with zero runtime cost for unsigned integer types. This is a trade off between compile-time and runtime performance.
    If Int and UInt were SIMD types (see [Feature Request] Make Int and UInt SIMD types · Issue #3995 · modular/modular · GitHub), we could replace the Indexer trait with a parameterized alias:
alias Indexer[d: Dtype] = SIMD[d, 1] requires d.is_integral()

That would simplify signatures from

fn __getitem__[I: Indexer](self, idx: I)

to

fn __getitem__(self, idx: Indexer)
  1. This has ergonomic implications. Requiring a standard Int type forces users to cast to and from the type they are working with. An Indexer alias would let users work with their original types without casting.
4 Likes

struct Int[signed: Bool = True]: ...

Here’s my 2 cents on implicit conversions IntUInt.

I can understand not allowing an implicit Int -> UInt conversion, because negative integers are incredibly common in production code, and we don’t want to underflow or crash the program in cases where a negative integer is implicitly converted to a UInt. That would be a massive footgun. Instead, the fact that the conversion can fail should be signposted with an explicit call to UInt(my_integer).

However, surely it is reasonable to have an implicit UInt -> Int conversion. The rationale being: ordinary integers are already capable of overflowing, but integer overflow is very uncommon in practice, because programmers rarely ever work with integers close to INT_MAX. We allow programmers to write x + 1 without ceremony, under the assumption that the operation is going to behave as the programmer expected.

By the same logic, I would expect us to allow an implicit UInt -> Int conversion under a similar assumption: the programmer is very unlikely to be working with a value close to UINT_MAX. And if they were, they would likely be checking their code very carefully anyway, because they are just as much at risk of UInt overflow as they are Int overflow.

In summary: Supporting implicit UInt -> Int conversions (but not the reverse) would likely get us the ergonomics we are looking for, and I can’t see any major downsides. If we support that conversion, then it seems reasonable for collection types to choose whether they want to support Int indexing (with negative indices), UInt indexing, or both.

Have I missed something?

1 Like

This issue is strongly related, especially this part:

struct Dtype:
    alias int = DType(__mlir_attr.`#kgen.dtype.constant<si> : !kgen.dtype`)
    alias uint = DType(__mlir_attr.`#kgen.dtype.constant<ui> : !kgen.dtype`)

alias Int = Scalar[DType.int]
alias UInt = Scalar[DType.uint]

If we could extend Int and UInt to be generic alias over width (defaulting to architecture width), this could simplify many things.

Does the python programmer care about the hardware, they just want their big int back.
It’s great if you have an array and don’t need to allocate double the size.
I don’t think this is a good API pattern in a python like environment.
It seems arbitrary to a python programmer to only encode types the hardware supports natively.

But a strong optimization idiom is almost like a guarantee, although it requires extra knowledge.

Even if you are fine with ignoring the top most bit, and ignore that it is probably a security concern, you still cannot optimize the code.

BigInt as default is just inadequate for system level and performance oriented programming.Mojo’s possibilities go far beyond being compiled Python.

@Nick

I don’t like to have casts which narrow the range of values in any way be implicit. On 64 bit systems I agree that having numbers close to SIZE_MAX (UInt.MAX) is generally a sign of a bug, but on 32 bit systems numbers near 2 billion are a lot more common, and for ML in particular the chance of “really big array that uses most of memory” is higher than in normal code.

Continuing on with related discussion with @clattner from Discord, I think that we can safely allow widening casts which don’t change the sign to be implicit. This means that UInt8UInt16 is implicit, but UInt16UInt8 must be explicit. UInt and Int change their bit width depending on the target, so I think the best behavior to encourage portable code is to require explicit casts. So long as the sign stays the same and the cast is strictly widening, the cast is safe to do since the new type can represent a superset of the values the old one could. fp8_e4m3fp32 casts may also be worth consideration, since they make it easier to handle things like CPUs which may have a very limited fp8 operation set but support far more operations (ex: trigonometric operations) at fp16 or fp32.

Chris is concerned about an Int vs UInt schism in library APIs, especially given that many programmers do default to Int and due to opinions like those voiced by @leb-kuchen where some programmers may want to disregard the hardware. However, I think that this can be solved in part by a bit of education and by the way the standard library sets defaults. I’ve already seen enough confused python programmers to see that Mojo probably needs to explain integer precision in the manual anyway. Given that, I think we can use that as a segue into talking about using types to constrain what the program can do. For instance, Python’s len builtin already throws an error if the argument’s __len__ returns a negative number [1], and we can provide that fact along with an explanation that, instead of providing a runtime error for that, we make it so that users get LSP feedback when writing an invalid __len__. The default return type of len is also very important. With it, we can nudge people towards using a particular type. Right now, Mojo’s len returns an Int, the same as Python does, but actually violates the expectations of Python developers by being able to return negative values. If we instead use UInt to constrain that, we are expressing the same thing as Python does, except at compile time where LSP feedback can help users who write incorrect __len__ implementations.

Both C++ and Rust have both a UInt equivalent and an Int equivalent, but my personal experience and the data I gathered from a variety of projects showed that the UInt equivalent is far more popular. In Rust, I think that is partially due to both having the len(arr) equivalent return the UInt equivalent. In C++, with implicit casting everywhere, developers still use the UInt equivalent more, despite there being no extra effort needed to use the Int equivalent everywhere in their own code.

I think that some of the confusion in Mojo comes from the how many of us learned to code, with loops like for (int i = 0; i < N; i++), or from environments where int is the default integral type. One bold option we have is to rename both Int and UInt to something else, such as Size (UInt) and Offset (Int), and not actually have an Int type, which may be a bit jarring to people to begin with, but should help them consider what type they actually want. This helps sweep aside the ingrained “int by default” habit than many people have, and I suspect that we would find the choice between Size and Offset less controversial for the return type of `len. I also think that, perhaps, lots of explicit casts may push people towards the unsigned version even if they would normal default to the signed one.

There is also the question of negative indexing. As I’ve said before, I think that having it everywhere by default is dangerous since many people from languages without that convention may not thing of it, and simply cause errors when it is used. By making negative indexing an opt-in, we can attempt to encourage only those data structures which have handled negative indexes to let users provide them. I think that the Indexer trait is a great triumph of Mojo, in that it lets us have the best of both worlds, zero-overhead unsigned and negative indexing support in one package, even if it continues the venerable tradition of “making writing good library code require knowing the language better than writing application code”, which is seen in many, many languages. I did look around and most CPUs handle immediate (in the assembly meaning) negative offsets from a pointer nicely, but I tried a few DMA accelerators (Intel DSA and AMD PTDMA), which both promptly blew up at me when I asked for a negative offset as part of a gather, and libverbs, which is the API that RDMA usually goes through on Linux, along with many other MPI implementations, does not believe in negative indexing at all. Meanwhile, MPI_Gatherv uses C int for indexing and for counts, and NCCL uses size_t. As a result, if we choose a single default, we need to pick whether to force ML programmers to do tons of casts or HPC programmers. If we use Indexer as the default, we can easily adapt to what external libraries expect for minimum performance loss, and then what matters is our internal default, meaning what len returns. Given that the expectation of python programmers is already that len can never return zero, it seems silly to me to not encode that in the type system.

On another note, Chris made the comment

We don’t encode < 100 into types.

This could be something we want to explore at some point. Ada has seen a lot of benefits from it in safety-critical software and having an imperative language with that capability well integrated that both looks like something more programmers are familiar with and is older than C would likely be something that a lot of industries would be interested in. While a lot of work, exploring tracking value ranges in the type system might allow us to make most implicit casts safe, especially when combined with refinement types.

[1]

>>> class Thing:
...     def __len__(self):
...         return -1
...         
>>> len(Thing())
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    len(Thing())
    ~~~^^^^^^^^^
ValueError: __len__() should return >= 0
1 Like