`Indexer` and `Int` vs `UInt`

UInt8 to Int16 is also safe, as it preserves the values, I don’t understand why the requirement of preserving the sign is needed.

Great write-up. As I’ve said in Discord I completely agree with defaulting to UInt, and I think we should nudge users into using it more than they are accustomed to.

One bold option we have is to rename both Int and UInt to something else, such as Size (UInt) and Offset (Int)

This is something that I think is worth really considering. We can bikeshed on what exact words best describe the ideas*. But I think it is important to visually and mentally acknowledge the fact that they are using a potentially signed value, and not give the option to do that unknowingly.

*: I think I would go for something less extreme like keeping UInt and renaming Int to SInt. For me the historical error that we are inheriting is naming signed integers just integers.

If we can’t give the python programmer their BigInt back, can we at least give them a consistent Int type which is good enough for most use cases. Imagine what the python programmer must feel, when they see a dozen int types. Representing numbers like the hardware in APIs is just inadequate for a superset of python.

Chris Lattner, do you think Scalar[start, end] is doable with reasonable compile time impact and can be lowered to the llvm range attribute?
There are types and functions which would benefit from this type, like pop_count and Codepoint.

Ok, I’ll admit that I can see scenarios where a library was written under the assumption that a UIntInt conversion is safe because “nobody uses the upper bit anyway”, but then the library ends up getting used on a 32-bit platform and the library’s users are suddenly passing in indices close to 2^32. It’s a niche scenario, but I can see it happening.

But I still think it’s fine to offer implicit UInt64Int64.

It’s important to remember that len and __len__ are different functions. We don’t need to force len to return a UInt in order to guarantee that its return value is ≥ 0, nor must we force len to perform a runtime check on the return value of __len__. There are a lot of other options, depending on how we define the Sized trait.

For example, maybe implementers of Sized could be allowed (or forced) to return a IntNonNegative from __len__, that already has a built-in “proof” that the integer is non-negative. len would then extract the underlying Int and return it. (The aforementioned type is different to UInt in that it maxes out at 2^63 or 2^31.)

Of course, the simplest solution would be for len to perform a runtime check, as in Python. That whole IntNonNegative trick is just a performance optimization, and a small one at that.

Again, I don’t think this is relevant. The fact that other languages use UInt heavily doesn’t mean that they’re doing it for good reasons. It’s potentially just a consequence of those languages being an evolution of C.

This would only be viable if we make x -= 1 checked/raising when x is a Size. Is that something you’re willing to concede? Otherwise, if we were to encourage Mojo users to use Size, there would be massive footguns when users inevitably write loops such as:

while x >= threshold:
    ....
    x -= amount

if amount > threshold and the subtraction is unchecked, then x will underflow. We shouldn’t trust that all Mojo programmers will notice this edge case, so if we’re encouraging people to use Size/UInt, the only option we have available is for subtraction to be checked/raising. (And then we’d offer a Size.__unsafe_subtract function for users who need extra performance.)

I should clarify: I agree that it could be worth exploring this Size & Offset naming scheme. Those names certainly makes more sense than UInt and Int, especially for Python programmers, who are used to the idea that “Int” means “an arbitrary length integer”, and that integers can never underflow or overflow.

We just need to ensure that the average programmer is able to use Size safely and correctly.

Agreed. If collections that support negative indexing have the same signature for __getitem__ as those that don’t, that’s going to be a source of bugs. Although, I’m not sure how commonly such bugs would occur in practice.

I would be okay with __getitem__ having two distinct signatures, depending on whether it supports negative indexing or not.

Also - FWIW there seems to be a strong sentiment (at least for the people here and on discord) of the usage of UInt for sizes to help with “correctness” and meaning. However, I do believe it is important that everyone also take a look at the opposite view of using Int as the standard type - or rather the pitfalls of unsigned integers. I’m not arguing one is better than the other, but it’s worth looking at other use cases and opposite views.

Some references:

2 Likes

That’s a good list @nate!

@owenhilyard you’re not going to like Bjarne’s paper linked above! He goes into detail about how using unsigned integers for subscripts and sizes in C++ was a mistake in his eyes. :face_with_open_eyes_and_hand_over_mouth:

1 Like

What if we just… banished all of Mojo’s unsigned integer types to some corner of the stdlib, and gave them ugly names like UncheckedUInt64?

This would stop programmers from inappropriately reaching for them because “Oh for my use case, the number is never negative”, and then stumbling into tons of underflow bugs when they start subtracting. As Bjarne points out, programmers often reach for unsigned integers under the belief that because they don’t need to store any negative values, they should pick a type for which negative values are unrepresentable. But in doing so, they aren’t picking “natural numbers”, they’re picking “modular arithmetic”, and suddenly innocent subtractions become a massive footgun.

Let’s be clear: how often does a programmer really need an unsigned integer with unchecked underflow? Such a data type is only needed for very low-level use cases, e.g. bitstrings and cryptography.

Crazy proposal, let’s rename:

  • UIntUncheckedUInt, or IntModuloPtrWidth
  • UInt64UncheckedUInt64, or IntModuloExp64, or just Mod64

This is a clear signal to programmers that these types exhibit modular arithmetic, and more importantly, that these are not the correct types to reach for when you want to store a non-negative integer.

If we don’t signpost the pitfalls of UInt through the name of the type, then I would alternatively advocate that subtraction is checked/raising. This forces the user to handle underflow.

In fact, maybe we should split UInt64 etc into two distinct types:

  • Nat64, where subtraction is checked/raises.
  • Mod64, where subtraction and other arithmetic ops are officially modulo 2^64.

This seems like a reasonable middle ground! This naming scheme signals to programmers that they probably shouldn’t choose Mod64 for the purpose of storing non-negative integers. If they really need a special type for that, they should be using Nat64.

(Maybe the type should be Nat63, so that we can justify an implicit conversion to Int64.)

Summary

  • Mojo should consider offering a set of “non-negative integer types” in the standard library, but the UIntXX family of types is NOT suitable for that purpose, because they underflow below 0, which is something that almost nobody wants, except when doing bit-twiddling or cryptography. I would suggest we rename this family of types to ModXX so that people only reach for it when they actually want modular arithmetic. This will prevent programmers from falling into the common trap of assuming that they should reach for an “unsigned integer” whenever they are storing a value that cannot be negative.
  • If we end up with a non-negative integer type, it should probably be called NatXX, and subtraction should be an operation that raises. However, before we considered adding such a type to the standard library, we would need to identify a compelling set of use cases. It’s not clear whether function calls that return “sizes” (such as len) should return NatXX rather than IntXX. We would need to see what the ergonomics are like.
  • It’s safe to implicitly convert between NatXX and ModXX (both ways), and this would have no runtime cost, so it would be relatively painless to offer both of these types in the standard library.
  • Alternatively, if we offer Nat63 instead of Nat64 (and so on) we could support implicit conversions from NatXXIntXX . (Whether Nat overflow should be checked or not is a separate question.) Keeping that upper bit free is also useful for some memory layout optimizations.
2 Likes

I don’t think that narration about Python superset is still valid

We can simply make substraction of UInt raising as you suggested. Using UInt in substraction context is far less frequent than addition.

I strongly disagree to banish unsigned integers. It is yet another tool which has a lot sense in lot of optimizations

Hello, i’m not familiar with other languages and so on,
so i’ll try to start from some of the mojo vision an goals.
(At least, what i think it is from podcast and over time, go check the website for it! )

An unlock for both people and the hardware,
This involve an gradual path, with variable level of complexity.

Bring people in, both experts and thoses who want to grow their talent.
The industry become more productive and consumers are better served.

The expert part is easier, hardware go brr, so how does talents can grow?

Meet people where they are, and start from there.
Easy to learn, simple, progressively more.
From broad to narrow, general to specialized.
(from int to uint :p)

 

Why Int ? (I don’t know all the why’s)

Python have int but not uint.
So programmers are already familiar with Int.

I think that for progressive complexity, UInt would feel like an step too huge.
(indexing an list->no, learn to cast first)

There is also the __int__ dunder.

Now considering interop between mojo and Python, int seem to be the natural choice.

I personnaly love UInt for some things, but it comes with casting, and checking too.

I think that the Int is more versatile, and obviously UInt is only one sided.

We could possibly jettison UInt in favour of Float.
(because there is Scalar[dtype.*] for many numericals)
(but not sure if uint is needed or not)

 

The casting:
UInt means casting, it cannot be done automatically:
UInt8(128) → Int8 == ?
Int8(-1) → UInt8 == ?
Casting manually means the user have to suffer.
(both from early complexity, clutter, ..)

Another thing is that UInt is basically for >= 0.
This means that the api take the burden of bound checking.
And if it does not, the user need to check the upper bound anyway!
The best way there in my opinion, was to favour an separation of concern instead.

 

Int with bound check is an smart default for at least theses reasons,
the mojo vision that brought us all together.

Personally i don’t care if i have to use int or uint, cast or not,
it is more for the people and the language,
there is no point if nobody can learn and become an expert

Since you are referring to C++ in your post, I would like to note that:

In Section 4.4 on page 73 of “The C++ Programming Language”,
Stroustrup writes

“The unsigned integer types are ideal for uses that treat storage as a
bit array. Using an unsigned instead of an int to gain onr more bit
to represent positive integers is almost never a good idea. Attempts
to ensure that some values are positive by declaring variables
unsigned will typically be defeated by the implicit conversion rules.”

In Section 4.10, advice item [18] is:

“Avoid unsigned arithmetic”

[https://groups.google.com/g/comp.lang.c++.moderated/c/hA83Norv2k4?pli=1\]

There are other similar voices in C++ community.

I think making distinct types for those use cases do make sense. May be Nat64 is UInt (as most developers expect so) and Mod64 is unchecked.

FWIW, I also prefer types that remove whole class of defects. And I support that we need to have explicit casts where there is chance of over/underflows. Although it might be inconvenient for the developers, I think it is morenof a question of education. Correctness > Performance > Ergonomie…

We might be able to make due with making sign conversions “lower priority”. The reason that safe sign conversions are difficult is that they can create ambiguities if you have multiple “paths” to a valid type, such as if you wanted to pass some integral type to a function with Int64 and UInt64 overloads.

I wasn’t aware of that range metadata, but that is a very interesting option that could unlock some interesting capabilities.

But I still think it’s fine to offer implicit UInt64Int64.

2 decades ago people often said that nobody used all 32 bits, so it is fine to do UInt32Int32 implicitly. While 64 bit is going to be harder to run out, I also don’t think there are many places where you’d need that conversion, since UIntInt or the inverse is far more likely. Possibly people dealing with very, very sparse data might run out of bits, but I think that consistent behavior is more valuable until we can find a good motivation to allow that cast to be implicit.

Right now, Mojo’s len returns an Int, the same as Python does, but actually violates the expectations of Python developers by being able to return negative values. If we instead use UInt to constrain that […]

It’s important to remember that len and __len__ are different functions. We don’t need to force len to return a UInt in order to guarantee that its return value is ≥ 0, nor must we force len to perform a runtime check on the return value of __len__. There are a lot of other options, depending on how we define the Sized trait.

There are a lot of processors out there which really don’t like branches, especially in ML where it seems to be fashionable to spam in-order cores with big vector units in new hardware. I’d personally rather make it a compile-time check since len is the kind of thing which is very much a hot loop function.

For example, maybe implementers of Sized could be allowed (or forced) to return a IntNonNegative from __len__, that already has a built-in “proof” that the integer is non-negative. len would then extract the underlying Int and return it. (The aforementioned type is different to UInt in that it maxes out at 2^63 or 2^31.)

I think that just moves the branch, since IntNonNegative is going to need to make sure the value passed in fits in Int even in the UInt constructor, and check for negatives in an Int constructor. I can also see many people looking at that and asking “Why not UInt?”

Of course, the simplest solution would be for len to perform a runtime check, as in Python. That whole IntNonNegative trick is just a performance optimization, and a small one at that.

I don’t think it’s a good idea to put a runtime check in such a common function, especially one which is going to end up being an alias for collection._length much of the time.

Both C++ and Rust have both a UInt equivalent and an Int equivalent, but my personal experience and the data I gathered from a variety of projects showed that the UInt equivalent is far more popular.

Again, I don’t think this is relevant. The fact that other languages use UInt heavily doesn’t mean that they’re doing it for good reasons. It’s potentially just a consequence of those languages being an evolution of C.

Then we should throw out both that data and any ideas of negative indexing from Python and start from a clean sheet of paper.

One bold option we have is to rename both Int and UInt to something else, such as Size (UInt) and Offset (Int), and not actually have an Int type, which may be a bit jarring to people to begin with, but should help them consider what type they actually want.

This would only be viable if we make x -= 1 checked/raising when x is a Size. Is that something you’re willing to concede? Otherwise, if we were to encourage Mojo users to use Size, there would be massive footguns when users inevitably write loops such as: … if amount > threshold and the subtraction is unchecked, then x will underflow. We shouldn’t trust that all Mojo programmers will notice this edge case, so if we’re encouraging people to use Size/UInt, the only option we have available is for subtraction to be checked/raising. (And then we’d offer a Size.__unsafe_subtract function for users who need extra performance.)

My personal preference would be to have the default be defined by the compiler, and then offer unchecked/checked/wrapping/saturating variants that are explicit functions. Then, Mojo gets a compiler flag for setting the default. While I like the idea of Rust’s debug/release builds, I think Zig’s Debug/ReleaseSafe/ReleaseFast/ReleaseSmall is a good set of default “profiles” to expose. In Debug/ReleaseSafe, we can make any overflow or underflow abort with a stack trace and the values involved in the operation (possibly the same for NaN/Inf in floats). In fast/small, 2’s complement wrapping can be the default.

I should clarify: I agree that it could be worth exploring this Size & Offset naming scheme. Those names certainly makes more sense than UInt and Int, especially for Python programmers, who are used to the idea that “Int” means “an arbitrary length integer”, and that integers can never underflow or overflow.

That is part of the reason I want to change the name, but the other reason is that there are multiple generations of programmers trained on “use int by default”. It may introduce some cognitive overhead to start with, but I think that not having an int in Rust is part of why we see the “correct” (as far as the hardware is concerned) types used more heavily. It’s possible that a name change may bring about a behavior change in developers.

We just need to ensure that the average programmer is able to use Size safely and correctly.

I think that making checked the default will probably do that just fine. I have some ideas for how to reduce the overhead of this (ex: Integer[min: IntLiteral, max: IntLiteral]) but that may have a fairly substantial compile-time impact.

There is also the question of negative indexing. As I’ve said before, I think that having it everywhere by default is dangerous since people from languages without that convention may not think of it, and simply cause errors when it is used.

Agreed. If collections that support negative indexing have the same signature for __getitem__ as those that don’t, that’s going to be a source of bugs. Although, I’m not sure how commonly such bugs would occur in practice.

Right now, there are a bunch of public methods in LayoutTensor which segfault if you pass a negative number to them. I think that it’s not unreasonable to think that there will be a substantial overlap between people coming from C/C++/Rust, which use unsigned indexing, and people who are implementing their own data structures in Mojo. In my opinion, preventing the bugs by construction is valuable even if we don’t think they’ll happen often.

Better than assumes everywhere

Swift

Use UInt only when you specifically need an unsigned integer type with the same size as the platform’s native word size. If this isn’t the case, Int is preferred, even when the values to be stored are known to be nonnegative. A consistent use of Int for integer values aids code interoperability, avoids the need to convert between different number types, and matches integer type inference, as described in Type Safety and Type Inference.

To me, this is based primarily on the path Swift took when making the decisions we are now debating. To me, indexing is something where I would like to be able to index the whole address space (native word size is a very messy concept) of the processor, which means that the one piece of language-agnostic guidance in there, to me, says to use UInt for indexing.

On C++ Signedness

Section names are taken from the paper.

The problem

Signed values are the result of most integer calculations

This is because both C and C++ have trained multiple generations of programmers to use int for everything. Signed values are the output because signed values are the input, and I’ve personally encountered a lot of int parameters in C/C++ libraries with a comment to the effect of “must be >= 0” or where the only valid negative value is -1, used in the way Mojo would use Optional.None.

In C and C++, unsigned does not model natural numbers

I find this highly unmotivating because signed types don’t model natural numbers either. C++ doesn’t have a native type to express 1 << 256 correctly, but that is a natural number. However, unsigned integral types do represent a range of values from zero to some value, which is typically enough for practical purposes. Even if often you could express all of the numbers you need, there is value in expressing things in terms of the most narrow type that works from a performance standpoint. This is why the first performance advice a Haskell developer will tell you is to make types which constrains Int to Int64 or UInt64 ranges, where a compiler pass will then convert from the arbitrary precision Int format to fixed-sized integers.

Why we have unsigned subscripts in the STL

As far as I remember (the STL is 25 years old so my memory may not be completely accurate) three
reasons were given for the STL using unsigned types for subscripts
• (As opposed to pointer subscripts) vector subscripts can’t be negative, so unsigned is obviously
the right type.
• We get one more bit to play with so we can get larger vectors; this is important on machines
with 16-bit address spaces.
• Range checking needs only one check (no need to check for less than 0).

I think these are very good reasons, and we now have people trying to do things with NPUs where a vector which needs most of a 32 bit address space is now not out of the question (especially for loading models where the user may need to ask for >2GB of space to copy the model into). Range checking is also a very good point, doubling the number of checks, especially on weaker processors, is going to cause problems.

C/C++’s unsigned is a very odd set of types. They do not model natural numbers. In particular,
they have modular arithmetic and conversions to/from signed ints that can be very surprising.
Beware of any argument using the word “obvious”.

I agree that the implicit conversions to/from signed ints is a problem, which is part of why I advocate getting rid of them.

Even the first version of the STL that I have tracked down specified that even though vector’s
max_size() was of the unsigned X::size_type it was specified as the largest positive value of the
vector’s signed difference_type or we could construct vectors with elements beyond the
accessible range. So much for that extra bit (extra range).

Without knowing why this decision was made, I’m not sure this is more than trivia. It’s also important to note that the allocation machinery borrowed from C (ex: malloc) still make use of size_t, so even if std::vector had a limitation to make overflow checks easier (what I think the reason for the range restriction might be), you could still make your own without that limitation. We are discussing making a change which would reach much deeper than this limit and either make it much harder to work around limitations like this or would force everything touching memory to have casts to UInt in order to interact with memory allocation facilities.

Checking signed 0<=x && x<max (often evaluated as (0<=x) & (x<max) to avoid branching) is not
slower than unsigned x<max on a modern computer. The number of loads in the two cases are
identical and unless the instruction pipeline and the arithmetic units are saturated, the speed
will be identical. Please check; I couldn’t check all hardware/compiler combinations.

Here are two modern cores where there are fairly large throughput benefits to using size_t instead of ssize_t per llvm-mca.

If only there was some workload which did a lot of indexing, and needed to sustain calculating indices for a long period of time while fast vector units did an operation they can do in only a few cycles (FMA), and where the algorithm’s authors cared enough to use prefetching so it was usually pulling data from L1/L2 cache. /s

In all seriousness, vector units have gotten fast enough that SIMD loops are fairly tight for many algorithms, so having tons of bounds checks in that loop may actually cause you to bottleneck on the scalar math. It’s also important to consider that unrolled loops with inlined indexing will burn branch predictor slots very quickly, and doubling that rate is a bad idea since branch predictor slots are a very limited resource. In some cases the compiler may be able to inductively remove the 0 <= x check, but Mojo has a stated goal of not having “oops, the compiler couldn’t figure it out and your performance drops massively” problems. For numeric code, like Mojo specializes in, a lot of code is going to hit the described condition where there is a performance difference. While it matters much less for application code where users are more likely to have lots of memory stalls, having separate indexing schemes for everything ML and for “control path” code seems like a poor idea.

Problems with unsigned

Mixing signed and unsigned numbers is a common source of confusion and bugs.

I strongly agree with Bjarne here, especially in C++ with implicit conversions.

However, there are two main sources of unsigned values getting mixed with signed ones:

  • People trying to ensure that only nonnegative numbers are passed where negative values don’t make sense

Yes, which makes sense since it’s using the type system to express the valid value range instead of

// int idx: The offset into the array. This function segfault if you give it a negative number, I hope you don't forget.
  • Loop variables and sizes

Which is mainly an issue of people doing for (int i … when they mean for (size_t i …, and why I think not having a type named Int might be a good idea.

Consider

unsigned area(unsigned x, unsigned y) // calculate area
{
    return x*y;
}

This appears to makes sense; after all lengths and areas can’t be negative. However, this definition doesn’t prevent area(-2,3).

This is a problem with implicit conversions, not with unsigned values. Once we get rid of unsafe implicit conversions, Mojo won’t have this issue.

Problems with unsigned sizes

Consider

vector<int> v(-2);

This is (of course a range error), but why? The reason is that the vector constructor takes an unsigned so -2 is interpreted as a very large (positive) number. That error is caught by a run-time check. The use of unsigned did not eliminate the need for that check.

One again, an implicit conversion issue. In Mojo, once implicit IntUInt is gone, this will be a type error.

The major problem is that unsigned sizes yield mixed signed/unsigned expressions. Consider a simple naive loop:

for (int i = 0; i<v.size(); ++i) v[i]=7

Some (but not all) compilers warn that the i<v.size() comparison mixes signed and unsigned and is
therefore suspect. There is hardly ever a real problem, so those warnings are annoying and confuse
novices.

This is another “int defaultism” issue.

“Obviously,” I should have written:

for (vector<int>::size_type i = 0; i<v.size(); ++i) v[i]=7

Had v.size() been signed, the loop as written would have been perfectly fine.

I’m open to examples, but I haven’t seen someone have .size() return anything other than a size_t in C++, except in cases where I would return an Optional[UInt] in Mojo since -1 was used to communicate an error or an uninitialized state. Mojo also has iterators which aren’t dangerous to use (thanks origin-based iterator invalidation protection), which should replace using integers in many loops.

We sometimes (often?) do arithmetic with sizes; notably we subtract sizes to find differences. For
example:

unsigned u1 = -2;
unsigned u2 = -4;
cout << is_signed<decltype(u1-u2)>::value << " " << u1-u2 << “\n”;
cout << is_signed<decltype(u2-u1)>::value << " " << u2-u1 << “\n”;

Now, we are all experts here and immediately spot the problem (right?), but the output of the second line could cause confusion.

This is why having checked arithmetic by default and an explicit opt-in to wrapping arithmetic is probably a good idea, it would catch this and the error message would explain the problem. If I remember correctly, similar was used as a recommendation for adding checked arithmatic to the profile (this is a very specific language feature/compiler feature that I don’t want to bloat this post by expanding on) options for C++.

I’m going to skip the next few since I see them as C++ issues (ex: needing vector<decltype(v[0])>::size_type), and not applicable to Mojo.

for (size_t i = n-1; i >= 0; --i)

This is one of the big footguns with unsigned integers, which Mojo completely defeats by having range and always using iterators for loops. If we feel the need to add this style of for loop in the future, a compiler warning for UInt >= 0 makes sense alongside “unnecessary check” warnings like if True/if False.

After this there are a few more “checked arithmetic is good” examples, where being signed just means you can use - and check it yourself instead of a checked_subtract function.

Span

Why does std::span have signed subscripts and sizes? The designers of span (originally gsl::span) had several aims:
• The primary intended use for span was as a replacement for {pointer,offset} pairs and pointer
arithmetic (incl. subscripting) is signed. It seemed unwise to introduce potential conversion
problems related to signed/unsigned differences.

While this is an interesting idea, if I was reviewing a PR with a span which starts before the pointer or ends before it, I would want a very, very good explanation of why that’s necessary. Every CPU I’ve been able to look at has 2’s complement pointer offsets and appears to do the conversion in hardware, and if there was a CPU which couldn’t do that, I would expect hardware support for the much more common positive offsets natively, which means that somebody is going to need to do the signed → unsigned conversion in software by moving the pointer backwards and then adjusting the offsets. I’m open to scenarios where this might not be what happens, but I can’t think of any. .

• Modular arithmetic can cause surprising behavior.

It can, but it’s also what every ALU I know of implements (I’d need to check LISP machines to be sure), so this is more an argument for arbitrary precision integers, which have a long list of well-known downsides.

• Modular arithmetic makes “overflow” well-defined behavior and thus inhibits error detection
and handling.

Mojo already does this by defaulting to 2’s complement wrapping. Adding a “check everything” mode to the compiler will be helpful in mitigating this, as will educating people on their options if they want specific things.

• span is not a container, so the analogy to vector is not compelling; on the other hand, span is
closely related to pointers (some versions of the idea have been called “fat pointers”). In other
words, a span is at a different (lower) level of abstraction that vector and other containers.

I think that for Mojo’s purposes, it’s more accurate to say that a span is a container which borrows its elements. C++ is looking to replace void foo(T* arr1, size_t arr1_len) with just passing in a span, but I don’t see many places where someone would want to pass in a negative arr1_len, or what that would even mean.

I think that the vast majority of this is related to implicit conversions and C+±specific problems.

A “unicorn type”

At the evening session, Chandler Carruth suggested we could define a “unicorn type” that simply did the right thing in combination with both signed and unsigned types. I like that idea. I wrote a primitive (uni) type to try out the idea, but I’m not proposing it because I consider my solution ugly and incomplete, I cannot be sure that it really is a drop-in alternative to the current uses of unsigned, and I would prefer not to use subtle types for really basic and frequent operations: int is good enough. Even if we had a really good uni, we shouldn’t start its introduction by breaking span so that we could fix it later.

Let “repeat” that

At the evening session, Chandler Carruth suggested we could define a “unicorn type” that simply did the right thing in combination with both signed and unsigned types. I like that idea.

Once we have UInt and Int in SIMD, that is exactly what we have, a uni type, and we are in a place where major breakages are permissible. We may not even need to break that much as Int moves over to alias Int = Scalar[DType.int] and UInt moves over as well. Right now, we have the Indexer trait, which is a bit messy, but in the future making Indexer essentially Scalar and using compile-time information to minimize the overhead seems like the best path forwards. C++ can’t have it due to the weight of legacy, the needs of backwards compatibility and the way C++ evolves, but Mojo can have it.

Futhark PL

Another reason is that whenever we use explicit indexing instead of just map, it is invariably because we want to perform index arithmetic, and unsigned integers are just not very good at arithmetic. For example, the subtraction x-y might not be representable as an unsigned integer, even if x and y are quite small integers. Overflow can of course also happen for signed numbers, but it tends only to happen for very large numbers (which are rare), while unsigned overflow can happen for numbers close to zero (which are common).

Being able to represent x-y where y > x doesn’t matter if you don’t have a reasonable behavior when that happens. If you don’t, the underflow bit should let you branch as if you had done the full check only in the case of underflow happening. While some architectures lack this, checking if x < x - y is the same operation as checking x < 0, so you’re not really at any advantage.

Another annoyance is that unsigned numbers have no value that is guaranteed to never be an invalid index, such as -1. These sentinel values are useful for constructs such as scatter, where they represent results to be discarded. One can of course argue that it would be even better to use a sum type with a distinguished constructor to avoid silent errors where mistaken out-of-bounds indexes are ignored.

I don’t find the arguments about using -1 for “invalid” values in scatter compelling given that no hardware I know of works that way, you typically use a mask register. There is a very long history of “-1 means something special” causing problems for C and C++ programs. For loads, this is especially in the case where you might have an architecture where you could actually do a scatter with -1 as a valid offset from the start pointer, and then you’re stuck unable to represent what the hardware can do with your API.

While we eventually want to do more advanced verification of Futhark programs based on sizes and indexes, this will inevitably require specialised machinery capable of understanding integer ranges directly, rather than encoding it in the structure of recursive types. When such machinery is available, using u64 for sizes has no real advantage compared to using i64 along a guarantee that the actual size is never negative. In fact, sticking to only i64 is probably helpful, as such analyses tend not to be great at handling type conversions.

This is another argument for Ada-style Integer[min: IntLiteral, max: IntLiteral].

C#

The main reason appears to be “Visual Basic can’t handle unsigned”, and you may have linked the wrong thing since that’s stack overflow.

Conference talk

I skimmed the talk and I generally agree with the arguments made.

2 Likes

See my above comment, he essentially proposes having Indexer at the end, and many of the issues are actually related to implicit conversions and various “C+±isms”.

I don’t think this is necessary.
Mojo should settle on one integer type for most apis and just offer a performant BigInt with predictable optimizations for those who want to be on the safe side. If we called Int WrappingInt, that would be a bit verbose. Nevertheless Int’s behavior is perfectly defined. Of course mojo could also have SaturatingInt, WideningInt and SafeInt or CrashingInt.