`Indexer` and `Int` vs `UInt`

On negative indexing, I probably should not be listened to because I’m not a typical Python programmer. I didn’t learn Python as my first language, and I still write a lot of C/C++ code. So for me negative indexing doesn’t seem “natural” and I rarely use it in my own code.

Mojo probably attracts people who are more aware of low-level and performance issues, but even so I think it would be better to approach this from the perspective of what a Python programmer expects to just work.

One possible test: do numpy Python programmers use negative indexing? If so, Mojo should also.

Can we separate unsigned ints into two types, based on use case?

First case is where the programmer wants an int, but it should never be negative. For maths code this is reasonably common. These are Nat32, Nat64, and maybe Nat? Following Owen’s “make invalid states unrepresentable” guideline this tells the compiler to add checks for negative values. In my experience, the theoretical loss of the one extra bit of range isn’t a problem.

Advantage of Nat compared to Unsigned is that implicit conversion from Nat to Int is always safe, and conversion from Int to Nat just needs a sign bit check.

Second case is unsigned as memory address/offset, where the extra bit is necessary. This should be the type with an ugly name to discourage people from using it unless necessary, and requiring explicit conversions to/from Int to remind everyone that this can go wrong.

1 Like

I’m actually very concerned about programmers coming from Python. In Python int is arbitrary length and programmers never need to think about “what if my integer overflows”. But in Mojo, Int will be 32-bit on some platforms, and suddenly we have a very real problem that the Mojo community (coming from Python) habitually selects Int for every integral field in their program/library, without thinking about how big the number might get. Suddenly, a few years down the line, people are finding mysterious crashes/strange behaviours in production when the Ints that they’ve deployed on a 32-bit platform suddenly overflow.

For that reason, I think we need to consciously decide whether Int is the right name for a platform-dependent integer type in Mojo, given we know that in Python, int is an arbitrary length integer.

IMO, there would be a lot of value in being explicit that the size of Int is platform-dependent. For example, perhaps the type should be Int[ws], where ws is a constant in the prelude that means “word size”. (Or your preferred terminology.)

Adding such a parameter would ensure that Mojo users (coming from Python) never choose a platform-dependent integer size by accident/habit, merely because it’s the shortest thing to type. Instead, they actually need to think about whether they want Int[ws], or Int[64], or BigInt, etc!

+1. As we discussed earlier, this would provide Mojo users with option that is safe (no unchecked underflow) and ergonomic (implicit casts to Int) when they believe they need to store an integer that is provably non-negative.

The UInt family of types should be signposted as being for low-level purposes. The name suggests that they are “non-negative integers”, which is incredibly deceptive. 99% of users who want to store non-negative integers do not want UInt.

That is an super smart observation.
This could be solved by phase 3: classes i think!
For example, the int class, which could be an bigint :+1:

I agree with you that unsigned is generally a better idea, and I think that a .rindex() function could probably replace many usecases of negative indexing. I think I may go explore the idea of stealing range types from Ada more, since that offers what I see as the best “everyone wins” solution, except for the person who has to write the type of course, since that will be painful.

We can define an alias:

alias Index = Some[Indexer]

fn __getitem__(self, idx: Index):
    var normalized_idx = normalize_index(idx, len(self))

# same as Int:
fn __getitem__(self, idx: Int):
    var normalized_idx = normalize_index(idx, len(self))

Simplifying the usage make it more consice and readable.

1 Like

I agree, I think that breaking peoples “int by default” habits may have a lot of benefits, and I think that may be how Rust avoided a lot of the issues that Swift saw, because in Swift many people defaulted to Int but Rust has no familiar Int it has isize, which is weird and unfamiliar, so people had to actually think about their integral types and what they needed. Even if we just call them Size and Offset, I think that may break the curse of fragmenting APIs. I’ve spent a bit more time talking to other devs about this, and it seems like “int-defaultism” is very strong. Most developers would almost always use int, but when told the language has no int and given the various fixed-width types and then usize and isize, most would choose usize, especially if they were told that len returns it. I don’t thing we need to do much beyond changing the name to something that isn’t Int,

I don’t think “using the right type” should be for “low-level purposes” any more than List vs LinkedList is a “low-level distinction”.

len is never going to return a type that underflows at 0. I’m sorry, I just can’t see it ever happening! :face_exhaling: The “welcome to Mojo” tutorials shouldn’t require a disclaimer that says “by the way, len(x) - 1 is officially 18446744073709551615 when x is empty, instead of -1.” Even if this behaviour is restricted to release builds, it’s still unacceptable.

2 Likes

Why not introduce a BigInt type? Then it is clear that Int is not „big“. Every integral type can be cast to BigInt implicitly but BigInt must be cast to the other integral types explicitly.

No one will use BigInt as an indexer because he would need to cast it all over the place. But using it as a value type would just be fine (e.g List[BigInt]).

I think there is already a BigInt in the community packages that might be used.

I think as long as var x = 0 is an Int, functions that take or return an index/size should return an Int by default too if not parametrized. This keeps the language consistent.

I‘m not sure if Indexer in function signature is worth it because it hides casts that are done internally. If the default is Int you can be sure that the least/none casts are done internally.

We still need an ssize_t equivalent.

len(x) - 1 is 18446744073709551615 in C, C++, Rust, Zig, possibly Fortran (you get to choose). If we want to avoid that, there is one popular systems programming language which avoids the problem, Ada, which does it with range tracking.

And, detecting that underflow is easy to do and can be a compiler switch. Detecting “this went negative when it shouldn’t” requires user intervention, and you still have what is likely an invalid value. When you combine this with negative indexing, it’s a source for very, very annoying to track down bugs. To my mind, we should have len(x) be signed XOR negative indexing because this introduces a much more insidious category of bugs. If you try to index with len(x) - 1 and len(x) is zero, you will get a range error immediately which leads you directly to the source of the problem. If we take the (fairly reasonable) assumption that needing 64 bits to address things is rare (outside of sparse numerics where you’re probably actually traversing a tree map of some sort), then this will always cause a crash or at least make it to some error handler. This means that we can still have underflow bugs with Int, and they are impossible for the compiler to automatically check because that same type is also used when the user intentionally wants negative numbers.

normalize_index(-1, len(self)) == len(self)-1

So it is actually the user intention to get the last element, they are semantically equivalent.
If there aren’t any elements it will index out of bounds in both cases.

This was part of a chain where len(x) returns a UInt and where the collection may be zero-length.

When the len is 0 it will out of bounds regardless of which index you give it.

You have a good point when the length is not zero:

len(lst) == 5

len(lst) - 6 == -1 # giving you last element instead of index out of bounds 
# and with UInt it will out of bounds

# But if you use the recommended way it does out of bounds, but a bit slower due to normalization.
lst[-6]

I don’t want to start going in circles; I believe I’ve made my opinions on integral types clear over the course of this thread.

I will summarize my beliefs:

  1. I agree with Chris that Int should be the default type for indices and sizes in the stdlib.
  2. At some point we should investigate having a Nat type that denotes the upper half of the Int type (i.e. 0…INT_MAX). This type wouldn’t underflow at 0, it would raise instead. The purpose of this type is to give Mojo programmers “non-negative integers” in cases where they desperately want or need it. This is low priority.
  3. The UInt family of types are a massive footgun (due to underflow at 0) and we should heavily discourage people from using these types to store non-negative integers. That said, they are needed for certain low-level applications (including GPU programming). Given that UInt64 operations are officially “operations modulo 2^64”, I think we should be calling these numbers “modulo numbers” and name them accordingly, i.e. IntMod64 or Mod64. When a Mojo user sees this type in the source code, this will make it abundantly clear that the right mental model is to think of subtraction as “subtraction modulo 2^64”, which explains why Mod64(0) - 1 == 18446744073709551615.
  4. The name Int is also deceptive, because in Python, the corresponding int is arbitrary length. I would expect Python programmers who adopt Mojo to habitually use Int for everything, without thinking about the possibility of overflow. (Even though Int overflows at ~4 billion on some platforms.) Therefore, in analogy with Int64 etc, I think we should seriously consider putting a suffix on Int, so that Mojo users are continually reminded that it occupies a specific number of bits. I would suggest something like:
    • IntA: “Architecture dependent” or “address-sized” integer.
    • IntM: “Machine-sized” or “memory-sized” integer.
    • IntX: We can interpret X as meaning “unspecified”, but as a bonus, RISC-V registers are named x1-x31, and IntX would have the same size as these registers.
    • If we want to be more explicit we could consider something like MemoryInt (in contrast to BigInt), and explain that a MemoryInt is used to count things in the computer’s memory. Therefore, if you need to store a number bigger than the computer’s memory, you probably want BigInt, or at least Int64.

The following example demonstrates why MemoryInt is an accurate name:

var n: MemoryInt
n = len(some_list)    # correct usage (safe); quantity is associated with the device's memory
n = EARTH_POPULATION  # incorrect usage (overflows on 32-bit devices); quantity is not associated with the device's memory

If we at least agree that this conceptual model “feels right” (vs. plain Int), that would be a good starting point. We can defer the bikeshedding.

Summary

I agree with Chris that Int is a decent default for counting and indexing in Mojo, but we should adjust its name so that people don’t misuse it for storing numbers that aren’t tied to the size of the device’s memory. Furthermore, the average Mojo programmer shouldn’t be using UInt for storing non-negative integers, so I suggest we rename UInt to imply “modular arithmetic”. That said, I agree with Owen that it’s useful to have compile-time “proof” that a number is ≥0, so at some point we should investigate adding a well-behaved non-negative integer type to the standard library, and perhaps the __getitem__ of List etc can be overloaded to accept that type. Collections that don’t support negative indexing would only accept the non-negative integer type.

1 Like

My take on naming: IntP (Int, pointer sized)

  • Prior art: That is the type name in NumPy (intp)
  • Consistent: Int32, Int64, IntP
  • Short: Good if there are explicit casts
  • Distinct: It is clear from the name, that this is not the Python int

Oh, sorry for the bikeshedding.

I agree with @Nick to have Int as the default, but maybe renaming it.

It is not really true that Java does not have unsigned integers, they have compareUnsigned, divideUnsigned, remainderUnsigned, toUnsignedString parseUnsignedInt and >>>.

I have a different idea. Why not just a negative indexing parameter.
fn __getitem__(Int, *, negative_indexing=True)
Thus no weird traits, dtypes and more clarity. You can just perform an unsigned compare for the the bounds checks.

This would be separate proposal. I am just saying that it is required to represent integer ranges logically. Even then, my opinion is this type should mainly be used for return types and internal logic.

There are libraries and runtimes supporting arbitrarily integers. What they have done is to invent like 20 different algorithms for multiplication. But they have not really focussed on range detecting and making the common case fast.

There is no pointer dereference , because the Int of a BigInt is stored on the stack.

But this does not mean, mojo can not have types which only work on CPUs. String is also not limited to 23 bytes, just because it’s faster.

I’m actually very concerned about programmers coming from Python. In Python int is arbitrary length and programmers never need to think about “what if my integer overflows”.

Please don’t worry about this, at least in this way. In the Mojo roadmap we have have been very clear that we’re building into the python community, but our goal is not to “be python”.

Our goals are to be a great thing for systems programming, build into application programming, then intersect the python world. I expect us to have a very nice “bigint type” by the the time we get to Python, and we will also support dynamic/object oriented features like “int” at that point as well.

We’re in a different epoch now. We’re aiming to provide ergonomics and “control” for systems programmers and make sure we can build a scalable standard library. It’s clear that we need to define generic algorithms, and it is clear that we don’t want implicit conversion of staticly typed data structures.

None of these concerns apply to Python programmers coming to Mojo in Phase 3, so please don’t use that as the reason to guide judgement about APIs today.

-Chris

1 Like

Strongly agree. Too many programmers just assume that Int can represent big enough numbers that they don’t have to worry about the bounds. In addition to Int32, Int64, etc., I would allow usage of Int as long as you explicitly assume responsibility for your choice by prefacing your code with assert(Int >= Int32) or similar statement to this effect.

This is a bit of a straw man argument :confused: .

I’m not a huge fan of Python, and I certainly don’t want Mojo to “be Python”. What I want is for programmers coming to Mojo to be able to build cool apps that are fast, secure, and reliable.

My concern is that Int is an incredibly subtle data type in Mojo, and yet the name doesn’t reflect this. When programmers come to Mojo (from anywhere!), they are going to bring their misconceptions about what an Int should be used for. The name implies that it’s the “simplest” integer type, whereas in practice, it’s the most complicated. If we pretend this complexity doesn’t exist, we are going to cause our users pain, and that’s what I don’t want to see happen.

As a simple example, let’s say I’m writing a library containing utilities for managing a website. My library defines a struct that stores a tally of site_visits. Great, site visits are an integer, so I should use Int right?

No… that’s a bad idea! If one of the users of my library wants to serve website requests on a 32-bit ARM chip, when they load the current site_visits tally from the database and store it in the Int, it’s at high risk of overflowing the integer, if the website becomes popular. (YouTube gets almost 4 billion site visits every single day.)

How did we mess this up? The root issue is that the size of Int is proportionate to the amount of addressable memory on the device upon which the library is running. If you are using Int to store a value that has no relationship to the device’s addressable memory (such as site_visits), you are misusing it.

I’m sure that a certain subset of Mojo users (e.g. C++ veterans) will spot this mistake, and a few of them will even say “duh, this is obvious”. But as language designers, we are setting ourselves up for failure if we are expecting our users to already be experts on such topics!

We can avoid this quagmire by signposting what an Int is actually for. This is very easy to achieve. We just need to choose a better name; one that reminds Mojo users that “this data type is for quantities related to the size of addressable memory”. The cost of making this change is very small, and the upsides are potentially significant. Even just putting a single-letter suffix at the end of Int would suffice. Consider two alternative universes:

  1. Mojo has Int, Int32, Int64, BigInt, etc. In this universe, people pick Int when they are trying to “move fast” because it’s the simplest-looking choice, and it seems to do the job.
  2. Mojo has IntMem, Int32, Int64, BigInt etc. In this universe, there is no “default” integer option. Therefore, the user has to consciously decide which integer type is the correct choice for their use case. If in doubt, they will always pick BigInt, because it is guaranteed to behave correctly, no matter how large its value gets.

As I mentioned earlier, I want Mojo users to build fast and reliable software. I don’t want them to be shipping a bug-riddled app because some library that they are using has chosen the wrong integer type. I acknowledge that there are infinitely many ways to ship a buggy app, but this is one class of bugs that we can mostly eradicate from the language just by naming the integer types carefully. It’s such an easy fix, and there are so few downsides.

TL;DR: What I care about is helping Mojo users succeed. The more papercuts we can eliminate with minimal effort, the better. The name of Int is such a low-hanging fruit, where the benefits far outweigh the drawbacks, that I think it would be a mistake to not do something about this.