Discussion: a comptime-signed int for indexing

There has been a long debate as to which type should be used for indexing, sizes and the like. My understanding is that Mojo wants to be a type-safe language, which implies that some bounds checking is done. Bounds checking could be simplified if an unsigned type was used. This, in turn could lead to underflow issues in some cases, though.

I think that a type whose sign is known at compile time could solve some of the issues. Picture an integer type whose sign is known at compile time. When we use it,

  • some operations clearly preserve the sign
  • some operations clearly flip the sign
  • some operations are ambiguous, requiring an extra check if the sign needs to be known.

I would argue that many operations fall into the first two categories in practice, making it efficient to work with such a type. In the other cases, an elaborate check would be needed anyway. If implemented, IntLiterals would materialize to comptime signed ints, and those would stay in the comptime signed realm as long as possible.

Example

Suppose the type in question was named ComptimeSignedInt[positive: Bool], where positive indicates whether it is a positive value. In a collection it could be used as follows:

comptime PosInt = ComptimeSignedInt[True]
comptime NegInt = ComptimeSignedInt[False]

struct SArray[ElementType: Copyable & Movable, size: PosInt](SSizeable):
    var storage: InlineArray[ElementType, Int(size)]

    fn __init__(out self, *, fill: ElementType):
        self.storage = {fill = fill}
    
    fn __getitem__(ref self, var index: Int) raises -> ref[self.storage] ElementType:

        print("Warning: inefficient indexing used.")

        if index < 0:
            index += Int(size)
        
        if index < 0 or index >= Int(size):
            raise Error("Index out of bounds")

        return self.storage.unsafe_get(index)
    
    fn __getitem__(ref self, var index: PosInt) raises -> ref[self.storage] ElementType:
        
        if index >= size:
            raise Error("Index out of bounds")

        return self.storage.unsafe_get(Int(index))
    
    fn __getitem__(ref self, var index: NegInt) raises -> ref[self.storage] ElementType:

        if index < -size:
            raise Error("Index out of bounds")

        return self.storage.unsafe_get(size + index)

    fn __slen__(self) -> PosInt:
        return size


def main():
    arr = SArray[Int_, 10](fill=Int_(0))

    print(get_type_name[type_of(len(arr)+1)]())
    
    for i in srange(1, len(arr) + 1):
        arr[-i] = i * 10

    for i in srange(3, len(arr) - 3):
        arr[i + 1] = Int_(100)

    print("Array contents:")
    for i in srange(len(arr)):
        print(arr[i])

In the example above, all indexing operations are efficient. Only srange(3, len(arr) - 3), which gives an iterator with signed values, is a potentially raising operation. This is intended if we decide in favour of bounds checking.

Note that I prefix some builtin names with “S” here to avoid conflicts. Also, I had to introduce another Int_ type to make some overloads unambiguous, but I think this would not be needed if this concept were implemented as a first class feature.

Question

I am sure smarter and more experienced people have thought about this, and that there are good reason why this is not the established way to do things. But why?

Otherwise, I would like to discuss this as a potential solution to the indexing problem and make this an “actual” proposal

Reference implementation

Here is a full refernce implementation to play around with.

from compile.reflection import get_type_name


@register_passable("trivial")
struct Int_(Indexer, Writable):
    """A replacement of Int that is not the default of IntLiteral."""

    var value: Int

    fn __init__(out self, value: Int):
        self.value = value

    @implicit    
    fn __init__(out self, value: ComptimeSignedInt):
        self.value = Int(value)
    
    fn __init__(out self, value: UInt):
        self.value = Int(value)

    fn __add__(self, other: Int) -> Self:
        return Self(self.value + other)

    fn __add__(self, other: Self) -> Self:
        return Self(self.value + other.value)
    
    fn __sub__(self, other: Int) -> Self:
        return Self(self.value - other)

    fn __sub__(self, other: Self) -> Self:
        return Self(self.value - other.value)
    
    fn __mul__(self, other: Int) -> Self:
        return Self(self.value * other)

    fn __mul__(self, other: Self) -> Self:
        return Self(self.value * other.value)

    fn __lt__(self, other: Int) -> Bool:
        return self.value < other

    fn __lt__(self, other: Self) -> Bool:
        return self.value < other.value
    
    fn __gt__(self, other: Int) -> Bool:
        return self.value > other

    fn __gt__(self, other: Self) -> Bool:
        return self.value > other.value
    
    fn __le__(self, other: Int) -> Bool:
        return self.value <= other

    fn __le__(self, other: Self) -> Bool:
        return self.value <= other.value

    fn __ge__(self, other: Int) -> Bool:
        return self.value >= other

    fn __ge__(self, other: Self) -> Bool:
        return self.value >= other.value

    fn __neg__(self) -> Self:
        return Self(-self.value)

    fn __mlir_index__(self) -> __mlir_type.index:
        return self.value.__mlir_index__()

    fn write_to(self, mut writer: Some[Writer]):
        writer.write(self.value.__repr__())

@register_passable("trivial")
struct ComptimeSignedInt[positive: Bool](Intable, Stringable, Representable, Writable):
    var value: UInt

    @implicit
    @always_inline
    fn __init__[value: __mlir_type.`!pop.int_literal`](out self: ComptimeSignedInt[True], var v: IntLiteral[value]) where IntLiteral[value]() >= 0:
        self.value = v
    
    @implicit
    @always_inline
    fn __init__[value: __mlir_type.`!pop.int_literal`](out self: ComptimeSignedInt[False], var v: IntLiteral[value]) where IntLiteral[value]() < 0:
        self.value = -v
    
    @implicit
    @always_inline
    fn __init__(out self: ComptimeSignedInt[True], var v: UInt):
        self.value = v

    @implicit
    @always_inline
    fn __init__(out self, var v: Int_) raises:
        print("Init from int")
        @parameter
        if positive:
            if v < 0:
                raise Error("Cannot initialize positive ComptimeSignedInt with negative value")
            self.value = UInt(v)
        else:
            if v >= 0:
                raise Error("Cannot initialize negative ComptimeSignedInt with non-negative value")
            self.value = UInt(-v)
    
    @always_inline
    fn __init__[have_checked: Bool](out self, var v: UInt) where have_checked:
        self.value = v

    @always_inline
    fn __int__(self) -> Int:
        @parameter
        if positive:
            return Int(self.value)
        else:
            return -Int(self.value)
    
    @always_inline
    fn __neg__(self) -> ComptimeSignedInt[not positive]:
        return ComptimeSignedInt[not positive].__init__[True](self.value)

    @always_inline
    fn __add__(self, other: Self, out result: Self):
        result = self
        result.value += other.value

    @always_inline
    fn __sub__(self, other: ComptimeSignedInt[not positive], out result: Self):
        result = self
        result.value += other.value
    
    @always_inline
    fn __iadd__(mut self, other: Self):
        self.value += other.value

    @always_inline
    fn __isub__(mut self, other: ComptimeSignedInt[not positive]):
        self.value += other.value
    
    @always_inline
    fn __add__(self, other: ComptimeSignedInt[not positive]) -> Int_:
        @parameter
        if positive:
            return Int_(self.value) - Int_(other.value)
        else:
            return Int_(other.value) - Int_(self.value) 

    @always_inline
    fn __sub__(self, other: Self) -> Int_:
        @parameter
        if positive:
            return Int_(self.value) - Int_(other.value)
        else:
            return Int_(other.value) - Int_(self.value) 
    
    @always_inline
    fn __add__(self, other: Int_) -> Int_:
        @parameter
        if positive:
            return other + Int_(self.value)
        else:
            return other - Int_(self.value) 
    
    @always_inline
    fn __sub__(self, other: Int_) -> Int_:
        @parameter
        if positive:
            return Int_(self.value) - other
        else:
            return other + Int_(self.value) 
       
    @always_inline
    fn add_bounded(self, other: ComptimeSignedInt[not positive]) -> Self:
        @parameter
        if positive:
            if other.value >= self.value:
                return Self.__init__[True](0)
        else:
            if other.value >= self.value:
                return Self.__init__[True](1)
        
        return Self.__init__[True](self.value - other.value)
    
    @always_inline
    fn sub_bounded(self, other: Self) -> Self:
        @parameter
        if positive:
            if other.value >= self.value:
                return Self.__init__[True](0)
        else:
            if other.value >= self.value:
                return Self.__init__[True](1)

        return Self.__init__[True](self.value - other.value)

    @always_inline
    fn __mul__(self, other: Self) -> ComptimeSignedInt[True]:
        return other.value * self.value
    
    @always_inline
    fn __imul__(mut self, other: ComptimeSignedInt[positive]):
        self.value *= other.value
    
    @always_inline
    fn __mul__(self, other: Int_) -> Int_:
        @parameter
        if positive:
            return other * Int_(self.value)
        else:
            return -other * Int_(self.value)

    @always_inline
    fn __mul__(self, other: ComptimeSignedInt[not positive]) -> ComptimeSignedInt[False]:
        return ComptimeSignedInt[False].__init__[True](other.value * self.value)

    @always_inline
    fn __eq__[other_positive: Bool](self, other: ComptimeSignedInt[other_positive]) -> Bool:
        @parameter
        if positive != other_positive:
            return False
        return self.value == other.value

    @always_inline
    fn __lt__[other_positive: Bool](self, other: ComptimeSignedInt[other_positive]) -> Bool:
        @parameter
        if positive:
            @parameter
            if other_positive:
                return self.value < other.value
            else:
                return False
        else:
            @parameter
            if other_positive:
                return True
            else:
                return self.value > other.value

    @always_inline
    fn __le__[other_positive: Bool](self, other: ComptimeSignedInt[other_positive]) -> Bool:
        return self < other or self == other

    @always_inline
    fn __gt__[other_positive: Bool](self, other: ComptimeSignedInt[other_positive]) -> Bool:
        return not (self <= other)
    
    @always_inline
    fn __ge__[other_positive: Bool](self, other: ComptimeSignedInt[other_positive]) -> Bool:
        return not (self < other)

    @always_inline
    fn __str__(self) -> String:
        @parameter
        if positive:
            return self.value.__str__()
        else:
            return "-" + self.value.__str__()

    @always_inline
    fn __repr__(self) -> String:
        return "ComptimeSignedInt(" + self.__str__() + ")"

    fn write_to(self, mut writer: Some[Writer]):
        writer.write(self.__repr__())


comptime PosInt = ComptimeSignedInt[True]
comptime NegInt = ComptimeSignedInt[False]

trait SSizeable:
    fn __slen__(self) -> PosInt:
        ...

fn len[T: SSizeable](value: T) -> PosInt:
    return value.__slen__()


@register_passable("trivial")
struct srange[positive: Bool = True](Iterable, Iterator, Movable):
    comptime IteratorType[
        iterable_mut: Bool, //, iterable_origin: Origin[iterable_mut]
    ]: Iterator = Self
    comptime Element = ComptimeSignedInt[positive]
    var curr: Self.Element
    var end: Self.Element
    alias step = Self.Element.__init__[True](1)

    @always_inline
    fn __init__(out self: srange[True], end: Self.Element):
        self.curr = 0
        self.end = rebind[PosInt](end)
    
    @always_inline
    fn __init__(out self, start: ComptimeSignedInt[positive], end: ComptimeSignedInt[positive]):
        self.curr = start
        self.end = end

    @always_inline
    fn __iter__(ref self) -> Self.IteratorType[origin_of(self)]:
        return self

    @always_inline
    fn __next__(mut self, out curr: Self.Element):
        curr = self.curr
        self.curr += self.step

    @always_inline
    fn __has_next__(self) -> Bool:
        return self.curr.value < self.end.value



struct SArray[ElementType: Copyable & Movable, size: PosInt](SSizeable):
    var storage: InlineArray[ElementType, Int(size)]

    fn __init__(out self, *, fill: ElementType):
        self.storage = {fill = fill}
    
    fn __getitem__(ref self, var index: Int) raises -> ref[self.storage] ElementType:

        print("Warning: inefficient indexing used.")

        if index < 0:
            index += Int(size)
        
        if index < 0 or index >= Int(size):
            raise Error("Index out of bounds")

        return self.storage.unsafe_get(index)
    
    fn __getitem__(ref self, var index: PosInt) raises -> ref[self.storage] ElementType:
        
        if index >= size:
            raise Error("Index out of bounds")

        return self.storage.unsafe_get(Int(index))
    
    fn __getitem__(ref self, var index: NegInt) raises -> ref[self.storage] ElementType:

        if index < -size:
            raise Error("Index out of bounds")

        return self.storage.unsafe_get(size + index)

    fn __slen__(self) -> PosInt:
        return size



def main():
    arr = SArray[Int_, 10](fill=Int_(0))

    print(get_type_name[type_of(len(arr)+1)]())
    
    for i in srange(1, len(arr) + 1):
        arr[-i] = i * 10

    for i in srange(3, len(arr) - 3):
        arr[i + 1] = Int_(100)

    print("Array contents:")
    for i in srange(len(arr)):
        print(arr[i])

This is a cool idea, but it would be great to explore it a bit more. How do operations like add and subtract work? 4-5 is negative, but 5-4 is positive. Add has the same issue when one of the operands is negative.

Thank you. Yes, maybe I should have been a little more elaborate.

There are clearly operations where the sign cannot be known at comptime, such as

  • subtraction of similarly-signed values
  • addition of oppositely-signed values
  • operations involving a “common” Int.

My argument is that if I put a value into a place where I want to enforce a specific sign (e.g. a size attribute), then I need to implement a check anyway. Hence, the casting from the unknown-signed domain to the comptime-signed domain should be either raising or an unsafe operation that needs explicit handling.

In other words, when I do stuff that could mess with the sign, then (1) I either need to check the result again when I put it back into a signed place, or (2) I am so certain that I use an unsafe operation for performance. Note that the check would only occur at the moment where I store a value into a comptime-signed variable. Consider the following code:

var a: PosInt = 123
var b: PosInt = 456
var c: Int = 10
var size: PosInt = a * c - b

Here, a*c already yields an Int. Then, a * c - b remains an Int. Only the result is casted (raisingly) to size. If I want to avoid the raising behaviour, I need to use an explicit unsafe cast.

It is true that this comes with some ergonomic issues. For example, PosInt will either not implement a -= operator, or the operator will be raising. If comptime-signed Ints became a first-class feature, we could think of introducing an unsafe decorator for enabling the operator without raising behaviour.

An example

Let’s explore this a little further in a bigger example. Suppose we wanted to implement an iterator that counts down.

The following implementation might be the straight-forward one:

struct CountDown(Iterable, Iterator, Movable):
    comptime IteratorType[
        iterable_mut: Bool, //, iterable_origin: Origin[iterable_mut]
    ]: Iterator = Self
    comptime Element = PosInt
    var curr: Self.Element

    @always_inline
    fn __init__(out self, start: Self.Element):
        self.curr = start
    
    @always_inline
    fn __iter__(ref self) -> Self.IteratorType[origin_of(self)]:
        return self

    @always_inline
    fn __next__(mut self, out curr: Self.Element):
        curr = self.curr
        self.curr = self.curr - 1

    @always_inline
    fn __has_next__(self) -> Bool:
        return self.curr >= 0

This does not work, since self.curr - 1 is not a comptime-signedness-preserving operation. If we allowed an extra decorator, we could express this as

    @always_inline
    fn __next__(mut self, out curr: Self.Element):
        curr = self.curr
        @unsafe
        self.curr = self.curr - 1

(internally, the decorator could set an unsafe parameter to True in the functions in the succeeding line). This would have the advantage that “dangerous” operations are clearly visible. Of course we could also introduce a different syntax without a decorator for this.

I would argue that we could implement this also without a comptime-signed Int internally, i.e., leave it to the library author to take care of the safety:

struct CountDown(Iterable, Iterator, Movable):
    comptime IteratorType[
        iterable_mut: Bool, //, iterable_origin: Origin[iterable_mut]
    ]: Iterator = Self
    comptime Element = PosInt
    
    # use an "unsafe" type internally
    var curr: Int

    @always_inline
    fn __init__(out self, start: Self.Element):
        self.curr = start
    
    @always_inline
    fn __iter__(ref self) -> Self.IteratorType[origin_of(self)]:
        return self

    @always_inline
    fn __next__(mut self, out curr: Self.Element):
        # The syntax below would not work with the reference implementation above, but
        # something along these line could be made up. 
        curr = PosInt[unsafe=True](self.curr)

        # This line is unproblematic now
        self.curr = self.curr - 1

    @always_inline
    fn __has_next__(self) -> Bool:
        return self.curr >= 0

Conclusion

So, I do see some disadvantages in the ergonomics, but nothing that could not be solved. It is a safety-first approach, but this could be of benefit in the long-run.

Interesting, this might give us a nice way to implicitly cast to UInt from a PosInt. But it only solves part of the problem and adds 2 more types for people to keep track of. I would personally prefer people to just start using UInt where they expect a value >= 0

On another topic:

This might still have some of the issues that stem from IntLiteral being lowered through a signed value, even if the internal storage is a UInt. We still don’t even have a way to have a UInt256 bigger than Int256.MAX because of how we’re casting it (see #4049). I think we would need a way to know if the literal is signed or not before defining what it lowers to. I opened #4342 a while ago but Chris answered that it doesn’t seem possible to implement.

Could you specify which problems are not solved? (I’d find it helpful to have these laid out explicitly, so that it is possible to think about whether they can be addressed somehow or are inherent.)

What I could think of is int overflow, which indeed I believe is not possible to solve with a simple type, and another one is platform-dependent precsion, which is orthogonal to the proposed solution, since we could use the proposed wrapper with any precision internally. Anything else?

I don’t think that two additional types would be a problem per se, because implicit casting to Int is always possible. So no one would be forced to use this. If this were to be implemented, people would be encouraged to use either PosInt (for sizes, positive indices) or NegInt (for negative indexing), so UInt would not need to be supported or could be handled with an implicit cast. So only one additional type, and it would remain an opt-in solution.

What I could think of is int overflow, which indeed I believe is not possible to solve with a simple type, and another one is platform-dependent precsion, which is orthogonal to the proposed solution, since we could use the proposed wrapper with any precision internally. Anything else?

So these types would only be used for indexing? I thought you meant them as a replacement for IntLiteral but with signedness information.

I don’t think that two additional types would be a problem per se, because implicit casting to Int is always possible. So no one would be forced to use this. If this were to be implemented, people would be encouraged to use either PosInt (for sizes, positive indices) or NegInt (for negative indexing), so UInt would not need to be supported or could be handled with an implicit cast. So only one additional type, and it would remain an opt-in solution.

Once we have Int and UInt be aliases for SIMD scalars, I think adding these types would pose an issue for generics. And I think using PosInt is just a roundabout way of using UInt for an API where a value >= 0 is expected

I would personally prefer if we made an implicit constructor UInt ↔ Int that raises on overflow/underflow , and used SIMD.cast() (or an e.g. UInt(unchecked=Int(0)) constructor) to give a performant escape hatch.

1 Like

Indexing and other places where values of a specific sign are expected would be the primary motivation. (And yes, Int literals should materialize to the types with explicit signedness information first.) The advantage over literals would be that it is also possible to use the values in dynamic computations where the resulting sign is clear. Think e.g. of a range iterator with positive start and end.

Sure, just with the additional feature that reversing the sign is possible (e.g. for negative indexing), and that int underflow issues are caught.

Maybe, but not necessarily. If this were to become a first-class feature, we could potentially add an according DType for it. Since some operators on ComptimeSignedInt would yield a different DType, this route may be blocked, though. Of course we could make the type wrap any UInt SIMD and cast implicitly to UInt and Int.

Why not lean into dependent types even more and do value-range based things, similar to Ada’s much loved range integers?

struct DependentInt[min: IntLiteral, max: IntLiteral]:
   ...

You can use the smallest data type that can represent the range, including safely using UInts, and you can have underflow/overflow only be checked when it could possibly happen.

Ex: DependentInt[0, 5](...) - DependentInt[0, 6](...) -> DependentInt[-6, 5]

It’s more stress on the compiler, but does also allow fully safe and checked math with no runtime penalty vs the best a human could do.

Keep in mind that LLVM has an Int(2^64 - 1) type, so if we want an arbitrary precision type, this might be a good option for a balance between usability and performance. The LLVM backend can decide how it wants to deal with various width integer types. This also nicely encapsulates things like bounds checking for statically-sized collections. It also means that the user will run out of memory before the implementation really runs into correctness issues.

This also lets a user “prove” that there is no need to bounds check with >= 0 for architectures where an Int → UInt cast can have a performance cost (ex: WASM), and generalizes any kind of bounds checking that can be done at compile time to be doable at compile time. It also allows for safe implicit casting to any kind of integer that can safely contain the value range.

3 Likes

I love it; we could even express literals as DependentInt[n, n]. I have no idea how well the compiler could handle all this, though.

What would be important to me is general ergonomics. My main concern is that these types are generally not closed under addition / subtraction / multiplication. Here are two practical questions:

(1) Would we always have to specify the value range when defining a variable? (Below I abbreviate DependentInt as DInt)

var n: DInt[0, 1000] = 123

What is the appropriate choice of the upper bound? I guess in practice, we would always use some kind of standard type like

comptime PosInt = DInt[0, Int.Max]

but this could not protect us from int overflows anymore, unless we always check for overflow if an overflow could potentially happen. The latter, in turn, would render PosInt as defined above very inefficient to work with.

(2) How well does this work with changing variables?

var a: DInt[0, 1000] = 123

a += 1    # won't work without an extra check; operator might not even be well defined
a = a + 1 # this would require a cast from DInt[1, 1001] to DInt[0, 1000], which requires a check or is unsafe

Wouldn’t all kinds of daily operations either require a ton of extra checks or be unsafe? In either case, this would not speak in favour of using these types…

I’d be curious how people use this in Ada. Could you provide some use cases with examples? I hypothesize we might always need some kind of “unsafe” intermediate type, so that the bounds checks are only done when they matter.

We need to implement interval arithmetic in the parser to make DInt genuinely useful and reasonably safe, which would typically require a built-in SMT solver. As soon as we use anything even slightly complex, such as n * n - 1, the inferred type becomes dreadful to read. In short, unless we are prepared to do far more work in the parser than we currently do, DInt is unlikely to live up to expectations.

Similarly, one can look at the more familiar problem of dynamic tensor shapes: canonicalisation there can only cope with very simple arithmetic expressions, and the same limitations appear here.

A built in SMT solver might not be the worst Idea. Preconditions and Postconditions are pretty great and it would fix some people’s issues with the recent decision to use Intas the default integer type even for things like len. If people want to use Int for everything and get compile time guarantees that something is non-negative we can just specify that in a condition.

I’d agree that having a built-in SMT solver is a good idea, if the language we’re talking about is F* or Dafny. The point is that they target a very different user base. If, for example, someone isn’t comfortable writing a proof of a + b = b + a using induction (twice, even), their experience with automatically checked pre- and post-conditions is not going to be very good. Whenever you have an unbounded loop, the SMT solver itself doesn’t help much: finding a suitable invariant and writing (at least part of) the proof manually is unavoidable.

Personally, I’m all for a refinement type based system, and I’m very familiar with Coq, Agda, etc. Even so, I don’t have a very positive outlook on having a built-in SMT solver in the type checker (for Mojo) and I think that says something.

I understand that a general solution (SMT solver or range integers) would be nice. It seems, though, that those are not easy to implement (at least short-term). Furthermore, at least the range integer solution faces the problem of these types not being closed under typical arithmetic operations, which causes significant follow-up issues.

I would like to refocus the discussion on the comptime-signed int: why not harvest the low-hanging fruit that we could (almost) achieve with that is there already? I argue that this type could already cover a huge chunk of use cases and provides an easy opt-in solution.

There is no doubt there are also downsides, but to assess those and look for potential solutions, they need to be laid out.

TL;DR of the original proposal
I propose a type SInt (1) whose sign is known at comptime, (2) whose sign is kept explicit in all operations where this is possible, (3) that turns into a sign-agnostic type in operations that may change the sign, (4) that can be implicitly cast to Int / UInt (the latter only for PosInt) and (5) can be implicitly constructed from UInt and Int, the latter as a raising operation.