Neither Rust nor Zig suffer from this problem. Nor does the inference algorithm that I described. The purpose of the algorithm is to pick the right bit-width based on the surrounding context.
For AI accelerators where each core has a 32-bit address space, every Mojo user who is building an app for that architecture will be deploying for both 64 bits (CPU deployment) and 32 bits (accelerator deployment).
I don’t know if any AI accelerators are going to use 32-bit addressing, but we shouldn’t assume that they are all going to be 64 bits. Especially given that low bit sizes (e.g. 4-bit floats) are a trend within the AI community.
The algorithm I sketched doesn’t care about control flow. It just gathers all of the function calls where x appears, and uses the argument types to filter down x. After doing that, if x isn’t narrowed to a specific integer type it chooses a sane default. That would be a rare situation—it’s very unlikely that you’re using an integer without ever passing it to a function call that requires a certain size! (printing is not a realistic example; very few programs create integers solely for the purpose of printing them.)
Here is a second proposal, which I think many people would prefer to proposal A.
Proposal B: Keep Int as the default type, but take steps to ensure that people don’t misuse it.
Readers who are fans of Mojo’s current design will prefer this proposal, because only a few small changes need to be made.
As mentioned earlier, Int is misused whenever it stores a quantity unrelated to the number and/or position of values stored in memory. For example, you should never use an Int to store site_visits or world_population etc. By doing so, you are writing a program that is likely to break when deployed to certain targets.
Mitigation 1: Rename Int, so that people understand its purpose
In today’s Mojo, programmers are likely to misuse Int, because:
Its name is so simple (simpler than Int64 etc), that it seems like the type you’re supposed to use whenever you want a “normal” integer. But in reality, you are only supposed to use to count or locate values in memory. If you use it for any other purpose, you will end up with code that is not portable.
The name “int” has a different meaning in other language communities. In C++ it’s de facto 32 bits, and in Python it’s a BigInt. Overloading “int” with a third meaning is a recipe for confusion, especially for programmers who are adding Mojo into their existing Python and/or C++ projects.
I propose renaming Int to Len. Rationale:
This name implies that Len is for storing lengths, sizes, and offsets into memory, especially given the implied connection to the len function.
The name is reminiscent of ssize_t in C++ and isize in Rust.
Intuitively, var site_visits: Len and var button_clicks: Len now feel like they have the wrong type, which is good, because they do!
As a bonus, Len ties in nicely with 0-based indexing. The array element x[0] is located at “length 0” from the start of the array.
Mitigation 2: Generate a compiler warning whenever an integer literal ≥ 2^31 is converted to a Len.
Large Len literals usually indicate a programming error. The now-famous world_population statement is one such example:
var world_population = 8142000000MISTAKE: Has nothing to do with the size of memory.
Beyond being a programming mistake, this statement is also non-portable. On 32-bit platforms, it would either fail to compile, or worse—it would overflow.
The sensible thing to do here would be to emit a compiler warning saying “Warning: This statement will not compile on all targets. Large Len literals are usually a mistake. Did you mean to declare an Int64?”
Mitigation 3: Prevent adding Int64 to Len variables, and suggest that the Len variable should be an Int64.
This ties into the recent discussions about implicit casting. We should disallow code snippets like the following:
var nums: List[Int64] = ...
var total = 0
for num in nums:
if <condition>:
total += nums
The problem with this program is that it is prone to overflow on 32-bit targets. As long as Len.__iadd__ doesn’t accept an Int64 on the RHS, and we don’t allow Int64 to be implicitly converted to Len, we should be good here. However, if we want to really help our users out, the error message should suggest declaring total to be an Int64, since that is likely what the programmer meant to do!
A decent compromise?
This is a modest proposal. Its goal is to ensure that Mojo users work with integers in a manner that is both correct and portable, while retaining the “Int by default” behaviour of today’s Mojo.
Defaulting to 64 bits means using 64-bit on 32-bit systems, resulting in worse performance than the native 32-bit (see example).
Defaulting to 32 bits means 64-bit users suffer and must explicitly specify 64-bit.
Inferring from context has several drawbacks:
It can make the type harder to predict, you have to scan the surrounding code carefully to understand the intended type. This is bad for code reviews and for system programming, which values predictability.
It creates spooky action at a distance, where changing code in one place can change the type in another place far away.
Ideally, if an API expects an Int8, you don’t want a variable to be automatically inferred as Int8, because that can lead to overflow bugs. You’d want to specify Int8 explicitly and have the compiler warn if you pass a variable of a different type. The API usage may be far from the variable declaration, with intervening logic that doesn’t account for Int8.
I don’t care much about the name change. But I think Int can be used for any purpose that fits in the range of the type, it doesn’t have to be only for lengths. For example, anything that fits in 32 bits can be stored in Int on a 32-bit system, so I’m not sure Len is a better name.
The Int type is special: it is backed under the hood by the MLIR index type with the index opset, which the compiler can optimize better than explicit Int32 or Int64 types (backed by the pop opset). At least that’s what the Modular team mentioned before here, though that may change. So unless you need the memory savings from using Int32, there is an advantage to using Int.
I think regardless of the compiler capabilities there will always be some advantages to using the native word size of the architecture, as it might lead to better performance, e.g. not requiring sign extension, so even if you know your value fits in 32 bits, using Int can still be beneficial.
I think that’s a good idea. But maybe do that only when compiling for 32-bit targets and for shared libraries. There’s no need to warn when compiling explicitly for 64-bit targets.
Defaulting to 32 bits means 64-bit users suffer and must explicitly specify 64-bit.
I think that programs which know that they are going to need to handle values larger 2 billion are going to have to use 64-bit on 32-bit architectures anyway. Additionally, 32-bit on 64-bit is generally fairly cheap thanks to C’s defaults. I think that having 32-bit be the informal “default value” is fine, since for many values many programs are erroneous or operating very far outside of their expected usage if a number goes above that. This is how we’ve gotten away with continuing to use int for so many things in C/C++.
I don’t like inferring, which is why I would make it default to the smallest type which can hold the value (easy to figure out in your head), and then use dependent types to add safe implicit casts.
Mitigation 2: Generate a compiler warning whenever an integer literal ≥ 2^31 is converted to a Len.
I think that’s a good idea. But maybe do that only when compiling for 32-bit targets and for shared libraries. There’s no need to warn when compiling explicitly for 64-bit targets.
The whole reason we want this is because many people won’t compile for 32-bit when building libraries, so a bit of a poke that says “disable 32-bit support in your library or fix this” may help make more stuff work on slightly weirder AI accelerators and older platforms.
I’m sorry for not being precise. I’m saying that newer, scaled, languages have done this (and yes, I’d include both Go and Swift as “newer” languages - Rust and Swift are of the same vintage) validating the design point. I’m not saying that it is the only design point, only that they are very large communities trying lots of things and it has worked out well for them. I didn’t mean to claim that this was “the standard” or something, I’m sorry for the confusion.
While I appreciate your enthusiasm about wanting to debate the idea of Int and how it should work, this isn’t the topic of this thread and Int’s behavior and name is pretty well set, so let’s not spend cycles on that, at least for Mojo. I agree with you that other languages could choose different design points!
Makes sense. The team needs to continue discussing this, and I don’t think they’ve made a decision here. That said, I can give you my opinion: List is the canonical replacement for something like std::vector / std::span and rust array/slices. These types need to be consistent, and signed integers are a thing. The reason that people are reaching to Mojo and static types in the first place is that they care about performance, and supporting negative indexing is a significant performance burden, so I don’t believe it should be the default.
Note that Mojo (and Swift FWIW) can support keyword arguments for subscripts, so we can support explicit opt-in overloads that work with negative indexes if we want. The only question is “what is the default”.
As you know, I don’t think that keying off signedness of an integer type is a good way to go here, such a thing penalizes the default integer type.
If such a use-case becomes important in practice, then we can add subscripting with Indexer then. Proactive design without concrete evidence of utility isn’t a good thing to go for in my experience.
Specifically in this example, why would you be using BigInt for something that you want to index into an array? The use-case isn’t obvious at all to me. Such a type shouldn’t implicitly convert to smaller integer types, so people would use it when they actually need the large size.
Note that I don’t expect BigInt to be used as a Python int replacement, because (as mentioned above) Python ints have reference semantics and dynamic typing and other things going on.
+1 agree. Merging all these things together will be great, but I’ll still advocate for the default integer type for general APIs to be Int, not SIMD[_, 1] or similar
Indeed, that is a good point, Mojo does already support multiple address spaces and supports address spaces with different pointer sizes. This is a great argument to support Indexer for __getitem__/__setitem__ specifically without generalizing to other things.
At the machine level, an unsigned comparison against N will implicitly exclude negative numbers. There’s no UB. That’s all I’m saying.
Yes absolutely. It isn’t about heavy-weight loop transformations like vectorization, it is about basic codegen that Mojo users expect from writing normal code using range(n) etc. Consider this C code:
uint32_t n = ...
float *A = ...
for (uint32_t i = 0; i <= n; ++i)
A[i] += 1.0;
This is a perfectly reasonable thing for someone to write, and we would like to compile this into something like this on ARM64 (pseudo code):
char *a_ptr = (char*)A; // machines work on register which are not typed.
for (uint32_t i = 0; i <= n; ++i) {
char *a_i = a_ptr + zext(i)*4;
tmp = f32_load(a_i);
tmp2 = f32_add(tmp, 1.0);
f32_store(tmp2 -> a_i);
}
But that’s horribly inefficient - it is doing a zero extension and shift inside the loop! The compiler instead produces:
char *a_ptr = (char*)A; // machines work on register which are not typed.
for (uint32_t i = 0; i <= n; ++i) {
tmp = f32_load(a_ptr);
tmp2 = f32_add(tmp, 1.0);
f32_store(tmp2 -> a_ptr);
a_ptr += 4;
}
This is all good, but then the compiler next wants to reduce register use, because this code has to keep all of i, n, and a_ptr live across the loop. We’d like to turn this into:
char *a_ptr = (char*)A; // machines work on register which are not typed.
char *end_ptr = a_ptr + n*4; // calculate the stop point.
while (a_ptr <= end_ptr) {
tmp = f32_load(a_ptr);
f32_add(tmp, 1.0);
f32_store(tmp -> a_ptr);
a_ptr += 4;
}
This is great, now we just need a_ptr and end_ptr live across the loop, 50% reduction in register use! Such a thing is a natural transformation, and something that happens all the time. This is not the level of optimization that we want Mojo programmers to think about - it is invariant to SIMD and memory hierarchy and other concerns we DO want them to think about.
The problem is that this transformation is invalid for C, because n in the original program might be UINTMAX, and therefore, the original loop might be infinite, whereas the resultant loop is not - this breaks the “as if” rule.
This is one of the many reasons why C makes int be undefined on overflow. It allow the compiler to make this transformation happen even when people use int as an induction variable.
This entire category of problems is defined away by making Int - the default type everywhere - the size of the machine register. You’re right that people working on 32-bit address spaces on GPUs are writing very machine specific code and probably want to use int32 for indexing there and we should support that, but this is the exception, not the rule.
Per the above, this is yet-another way that Mojo is better for performance than Rust.
Thank you for clarifying . I understand your point. I agree that Swift and Go haven’t suffered major issues from having their default Int type be address-space sized.
I am not just convinced that this will translate to Mojo, because Mojo is a lower-level language that will target a much wider variety of devices, including bizarre, quirky accelerators. I expect in certain corners of the Mojo community it will be very common for programs and libraries to be deployed across both 32-bit and 64-bit targets.
IMO this is at least partially on topic. This thread is about how/whether to use Int for indexing, and I have been discussing the risks of using Int for something other than indexing. By combining these discussions, we are examining Int holistically. I suppose I could have spun this discussion out into a side-thread, but that ship sailed ~50 posts ago.
This is the root of our differences in opinion. From what I can tell, your opinions about Int are driven by your (valid!) concerns about the performance and correctness of indexing. As a C++ veteran you expect there to be an Int type, and you’ve found a definition for Int that makes indexing simpler and more efficient than C, so that’s the definition we’re using.
I think that’s fantastic. I am on board 100% with Mojo having simple and efficient indexing. I appreciate you sharing some concrete examples of how Mojo achieves that. You are a compiler expert, and I am not, so I will defer to your expertise on this topic.
I assume that when you read my posts suggesting that we change Int, you see the performance of indexing as being at risk. But I really am not interested in changing the design for indexing. Mojo’s indexing can be exactly like Swift. I am very happy with that.
My grievances with Int are for all of the use cases that have nothing to do with indexing. I am attempting to speak on behalf of all of the future Mojo users who are going to use Int to store arbitrary domain-specific quantities.
Int is the wrong data type to store customer IDs, or site visits, or the total number of files in your file system. Those quantities are independent of the processor’s address space, and can easily exceed 2^31, so when you store those quantities in an Int, you are at risk of writing code that breaks when it runs on a different processor.
The name Int screams “use me for everything”. There is nothing in its name that suggests that it should be used to store quantities related to the size of the processor’s address space. For this reason, I can very confidently predict that if we look at Mojo in 10 years time, Int is going to be heavily misused.
I find this fairly obvious. That’s the problem I’ve been trying to solve.
Rust, Zig, and even C++ don’t suffer from this problem. I have proposed small adjustments to Mojo that would allow Mojo to solve this problem as well, and we don’t need to change how indexing works.
At bare minimum, there needs to be a massive cultural campaign in the Mojo community to teach new users that Int should only be used to store indices and lengths. We would need to hammer this into people’s brains when they read Mojo tutorials, and especially the official docs.
I will now let Modular decide whether they actually care about this. I have said all I can say at this point.
As you know, I don’t think that keying off signedness of an integer type is a good way to go here, such a thing penalizes the default integer type.
This is why I’m questioning if ssize_t is a good default integer type, and whether it’s worthwhile to leave values as IntLiteral until they need to be cast into something else so that we can safely cast to UInt when the value isn’t negative. If I remember correctly, part of the reason why we don’t have IntLiteral as the default Int type is because of a lack of implicit constructors and an inability to constrain constructors nicely (since we only had constrained), but I might be getting my timelines mixed up. If we assume that requires exists, and we have implicit constructors, I’m wondering if we can remove the ergonomics issues that made Mojo step back from IntLiteral in the first place.
This might introduce a tradeoff where code inside of MAX, or other code that deeply cares about compile speed, may need to be explicit about datatypes to avoid a bunch of extra MLIR floating around and the associated compile-time overhead.
Rationale: getitem isn’t special, it is just like any other function that needs to take an integer value, so it should follow the same standard.
I think this is where our disagreement stems from. I see size_t and ssize_t as tied very directly to the address space of the machine, and thus to indexing/pointer offsets, and Mojo’s Int/UInt as an extension of that.
Indeed, that is a good point, Mojo does already support multiple address spaces and supports address spaces with different pointer sizes. This is a great argument to support Indexer for __getitem__/__setitem__ specifically without generalizing to other things.
Now that I think about this more, this may be an issue for developers trying to write portable code that uses multiple devices (ex: using CPU + accelerator in a serving pipeline). If Int is a different datatype on different targets, then there are a lot of portability hazards. Does Mojo have a good way to ‘say’ “The index type for this address space on that target”?
I could easily see someone writing code that has a struct with an Int in it and then trying to send that to a kernel on a 32-bit device (like any major NPU), only to have the struct mysteriously lose a few bytes. My hope was that at some point Mojo would be able to use the various “trivials” to automatically determine types which are safe to move between devices, but any type containing an index won’t always be safe to move between devices. Given that we want to use index in “length of collection”, this could present an issue.
We could parameterize Int and UInt on target and address space to get around this, but that doesn’t seem like a great idea.
This still can cause UB if size < 0. It probably isn’t, but if we want to do this trick I’d prefer to store size as a UInt so that the compiler can help (via checked arithmetic) catch problems.
At the machine level, an unsigned comparison against N will implicitly exclude negative numbers. There’s no UB. That’s all I’m saying.
My understanding was that ARMv8 wants you to use different instructions (example from ARMv8 ISA manual):
Which means comparing signed and unsigned requires you to cast one of them, and if you bitcast a negative two’s complement signed to unsigned you get a set high bit. If the high bit of the unsigned value is set it will become negative if you bitcast it. You can do 2 comparisons or do UInt(a) < b and Bool(a >> 63).
Is that particular issue of loop transformations relevant to Mojo?
The problem is that this transformation is invalid for C, because n in the original program might be UINTMAX, and therefore, the original loop might be infinite, whereas the resultant loop is not - this breaks the “as if” rule.
This is one of the many reasons why C makes int be undefined on overflow. It allow the compiler to make this transformation happen even when people use int as an induction variable.
For Mojo, given that range is not inclusive, and if it were to use UInt, for i in range(0, UINTMAX) is not an infinite loop, and users would need to manually write a while loop in order to have i <= UINTMAX, because range loops are a less-than comparison, or create a range_inclusive iterator. If I write while i <= i.MAX: for any integral type, it’s an infinite loop.
Since range prevents that issue by construction, anyone who ends up with their own range type which uses Int32 as the index variable and stores the maximum as a some wider type has written buggy code. Even a range type which is generic over integral SIMD types would avoid implicit casts causing an issue here, even for UInt8.
I agree with you that using (U)Int32 for indexing is incorrect and a source of bugs for most cases, and the people who should be using it know why they want it. However, there are things which aren’t 32-bit address spaces where I can say with very high confidence that 32 bits or even 16 bits is enough, such as “number of nodes in a distributed database”, or an array of peer session information in a web server (16 bit port limit for TCP and UDP per IP address).
Rust encourages users to use fixed-bit-width types for as much as possible. In my experience, most developers end up using u32 or i32 because they are large enough that overflowing them often indicates a bug.
Per the above, this is yet-another way that Mojo is better for performance than Rust.
Cache occupancy vs the compute cost of upcasts is going to be an application by application decision. Given that modern cores typically offer more compute than most applications can saturate, I would actually place my bets on using less cache being better.
The best data I have is from DPDK’s ring_perf_autotest. This test shows throughput for moving “pointers” to data through a set of producer-consumer queues. In this case, “zero-copy” means a load from one queue and a store directly to another instead of having the batch deque into an intermediate array. Evaluted on an isolated core on my zen 4 laptop (so noisy data) with 1G hugepages, rcu and timer ticks enabled:
### Testing compression gain ###
### Testing zero copy ###
elem APIs (size: 8B) - burst zero copy (n:8 ) - cycles per elem: 14.203
elem APIs (size: 8B) - burst zero copy (n:32 ) - cycles per elem: 4.564
elem APIs (size: 8B) - burst zero copy (n:64 ) - cycles per elem: 2.984
elem APIs (size: 8B) - burst zero copy (n:128) - cycles per elem: 2.904
elem APIs (size: 8B) - burst zero copy (n:256) - cycles per elem: 1.949
### Testing zero copy with compression (16b) ###
elem APIs (size: 8B) - burst zero copy (n:8 ) - cycles per elem: 15.441
elem APIs (size: 8B) - burst zero copy (n:32 ) - cycles per elem: 4.256
elem APIs (size: 8B) - burst zero copy (n:64 ) - cycles per elem: 2.618
elem APIs (size: 8B) - burst zero copy (n:128) - cycles per elem: 1.662
elem APIs (size: 8B) - burst zero copy (n:256) - cycles per elem: 1.024
### Testing zero copy with compression (32b) ###
elem APIs (size: 8B) - burst zero copy (n:8 ) - cycles per elem: 13.289
elem APIs (size: 8B) - burst zero copy (n:32 ) - cycles per elem: 4.188
elem APIs (size: 8B) - burst zero copy (n:64 ) - cycles per elem: 2.585
elem APIs (size: 8B) - burst zero copy (n:128) - cycles per elem: 1.749
elem APIs (size: 8B) - burst zero copy (n:256) - cycles per elem: 0.991
### Potential gain from compression (16-bit offsets) ###
Gain of -8.7% for burst of 8 elems
Gain of 6.8% for burst of 32 elems
Gain of 12.2% for burst of 64 elems
Gain of 42.8% for burst of 128 elems
Gain of 47.4% for burst of 256 elems
### Potential gain from compression (32-bit offsets) ###
Gain of 6.4% for burst of 8 elems
Gain of 8.2% for burst of 32 elems
Gain of 13.3% for burst of 64 elems
Gain of 39.8% for burst of 128 elems
Gain of 49.2% for burst of 256 elems
These values need a pointer to the arena allocator the come from to convert into proper pointers, but for producer-consumer systems which try to maintain ordering guarantees there is often a single core which needs to touch all of the work items in order to keep track of ordering, so reducing the cache usage for that single core can be a large performance boost because otherwise data can spill to main memory. In my Rust bindings to DPDK, I have indexing using UInt16Ptr and Int32Ptr, which are both “newtypes” for pointer compression, for this reason. Making, say, InlineArray unable to accept non-Index types means that Mojo ends up with a similar ecosystem split to Rust and C++, where people doing high performance things have to go build their own datatypes and you get ecosystem fragmentation. This is part of why my IO proposal is so complex, because I want to stop that from happening across as much of Mojo as possible so that people who need performance can use normal libraries and normal libraries get to benefit from performance-oriented tuning.
I agree, software development as a field has decades of conditioning to use int as the default type, so I think that we need to be very careful what a type called Int does. In the early days of C, Int was always machine-word-sized, and then people got used to it being 32-bit so now we’re stuck with that. I think that Mojo is going to inherit what I’ll call “cultural debt” from people who assume that it is 32 bits. One other unfortunate consideration is that when trying to use LLMs to assist in porting libraries that they often make that assumption, and generate bindings or signatures assuming that Mojo Int is the same as C int. Those are going to be even harder to change than people’s habits, especially if people aren’t aware of the difference themselves. I think the best way to do this is to not have a type named Int.
Given that Int can be different sizes inside of the same program due to using an accelerator, I think we need to take a step back and reconsider how to handle mixed address spaces and address space sizes in a way that isn’t going to cause a lot of problems for users. Making a type which can be different widths in different functions in the same file the default seems very, very hazardous to me and I think many people will find it confusing. Eventually, people will want to move types other than LayoutTensor to a GPU and I think that having an ecosystem split of “libraries that need to send stuff to the GPU don’t use Int” would be bad for mojo.
There are always cursed ways to be backwards compatible. These always have different tradeoffs.
For example x86, they chose a slightly tweaked architecture, while arm64 opted for more drastic changes. However 32bit support was more costly for arm processors, so they dropped it. Thus I think mojo should learn from x86.
One way to be compatible with python would be to change the semantics of indexing in def and fn.
def py(l: List[T]): l[-1] # translates to `l.get_relative(-1)`
fn mj(l: List[T], i: Int): l[i] # translates to `l.get_absolute(i)`
I think this would be a great compromise between compatibility and performance. What do you think?
What do you think about overloading IntLiteral? I assume most uses of negative indexing are constant values.
Do you think keyword subscripts are more ergonomic than freestanding methods?
Compare.at(-1) vs [get_relative=-1]
One additional thing I’ll toss on that I just learned is that WASM was 32-bit-only until about 8 hours ago, meaning that a LOT of edge-worker runtimes, like cloudflare workers, are 32-bit. It also means that we’re likely stuck with 32-bit WASM in browsers for a long time since Safari doesn’t support the proposal for 64-bit-address spaces, nor does Wasmer (one of the more popular server runtimes). We’re stuck with an version LTS of NodeJS which doesn’t support it until 2027 as well.
If Mojo wants to support WASM for “AI in the browser” stuff, then 32-bit is going to be around for a while.
Also Espressif’s ESP32 (32bit RISCV) line of microcontrollers where some support vector extensions are likely to be used for edge AI workloads as well. At least it’s my dream to be able to use stuff like that with Mojo sometime in the future.
I’ll use this opportunity to say that I still see no compelling arguments against just doing:
Instead of forcing the whole ecosystem to use Int for “unification”'s sake. We have an advanced type system that can let people choose the datatypes for their use-case, why would we force people to use just a single type?
Also, again, I’m strongly against letting people pass Ints around assuming they are bigger than zero due to this. We have a great type system and can provide some safety guarantees with it, why throw that away?
Out of curiosity I looked up the history of isize and usize in Rust, which correspond to size_t in C++.
It turns out these types were originally named int and uint. The core team initially decided that int was a reasonable name, but then changed their minds after community feedback and after discovering incorrect uses of int in the stdlib, libraries, and even the Rust docs. The name was changed to isize to more clearly signpost the type’s purpose.
Folks here may find those old Rust threads insightful.
I wasn’t aware of this history, but this feels like a warning for Mojo. Rust started with exactly the same thing, int/uint and wrapping math, and ended up moving to where Rust is now, and I think it would be foolish of us to not study why they did that. A large portion of Mojo’s audience currently comes from C++ developers, the exact same audience Rust primarily pulled from, and I don’t think it’s unreasonable to expect the same things to happen.
I see a lot of the arguments made in that thread echoed by Chris, as well as seeing a lot of my arguments. Once again, I think it’s a good idea to take a look at how not having int has worked out for Rust.
I’m sorry for the delay, just catching up. Just to confirm:
Yes, that is the intention, Int should be used for almost everything, unless there is a good reason to use a more specific type. For example, you might want to use a UInt8 when you want to decode a UTF8 string or something.
I agree - If you do actually need to work with quantities that can be greater than 2^31, then you should use a more specific type (like Int64), just like you should use UInt8when doing specialized byte-level work.
As I mentioned above, this is how languages like Swift and Go have worked for many years, and it has worked out fine. You’re right that there is a possibility that someone writes non-portable code (i.e., code with a bug in it), but the alternative approach doesn’t address that either. The alternative is to use explicit sizes everywhere, forcing developers to think about size all the time - accidentally using u32 for such applications will have the same bug.
Further, you’re neglecting the benefit of a single unified integer type - it makes literally all APIs everywhere simpler, it makes the language easier to teach and use, and it makes components written by disparate developers interoperate more seamlessly by default.
There is no perfect solution of course, but in my experience, this is clearly the right tradeoff.
If you do actually need to work with quantities that can be greater than 2^31, then you should use a more specific type (like Int64), just like you should use UInt8 when doing specialized byte-level work.
Does this mean you want to close the door to Mojo supporting 8-bit or 16-bit processors? If not, then then needs number needs to go down to 2^15 or 2^7. I don’t think that is a door that can be re-opened if developers start making the assumption that Int is 32 or more bits.
You’re right that there is a possibility that someone writes non-portable code (i.e., code with a bug in it), but the alternative approach doesn’t address that either. The alternative is to use explicit sizes everywhere, forcing developers to think about size all the time - accidentally using u32 for such applications will have the same bug.
Even if that door is closed, developers are still stuck asking “can it be larger than 2^31?” for every quantity to determine whether Int is safe to use, the opposite of using u32 improperly. Many things involving quantities and sizes are not categorically limited to less than 2^31. You can’t even safely ask huggingface or other HTTP servers for file sizes in Int, meaning that HTTP libraries will be forced to use Int64 or UInt64. You can’t even safely ask a GPU for its vram capacity in Int, since I can take a 4090 and use 32-bit CUDA drivers with it on x86.
This brings me directly to another issue, which address space is Int defined as? Do values change ABI when passed from a 64-bit to a 32-bit target or vise-versa? I think it’s a very poor idea to have a default integer type which isn’t safe to pass to an accelerator in the accelerator programming language. NVIDIA and AMD spoil us by having GPUs be 64-bit alongside mostly 64-bit CPUs, but I have 2 ML accelerators in my server room which have 32-bit cores driving the compute and run circles around DC GPUs for scale-out bandwidth (3.2 Tbps per device, avoiding PCIe bottlenecks by merging the NICs and the compute). I have no data to back this up, but I think that parameterizing Int on the target it came from might be even worse than value range tracking as far as the compiler is concerned.
it makes literally all APIs everywhere simpler
I’m not sure it does. It makes the type signature of APIs simpler, but so did the original form of def which just made everything object. I have seen far too many “must be greater than or equal to zero” comments in various languages, which often signal that either UB is ahead if that comment is ignored, or that there’s an extraneous compare and branch which I must hope the compiler can eliminate.
fn foo(i: Int):
"""
`i` must be >= 0
"""
...
fn foo(i: UInt):
...
These two functions are equivalent, yet in one case the compiler can automate the check and it is assisted in proving that the invariant is correct, and in the other the programmer must read the documentation. If I had a dozen functions with various restrictions on integer ranges, would the API be simpler if I used Integer[min=..., max=...] or doc comments to express that invariant? I’d argue that the API contracts are equivalent, but the effort required for the programmer to uphold the API is lower when the compiler understands the API contract. Casts which may be required are a replacement for raises ValueError or an opportunity for UB, with the benefit that if you have already proved the value to be valid you may avoid the check.
it makes the language easier to teach
It will also require us to constantly explain why List[Int].sum() has half the throughput of std::reduce(vec_int.begin(), vec_int.end()) (or equivalent for other operations). I don’t think that, in a language centered around SIMD, you can prevent developers from having to learn about fixed-width types. Even if you could, I think that origins, dependent/linear types, comptime evaluation of functions, move/copy ctors, and a myriad of other things makes the bar for learning Mojo high enough that “Use the smallest type which fits your data” is going to be a tough lesson.
and use
I’d argue the opposite, since it can’t change API contracts, it instead forces users to read doc comments explaining invariants the compiler is perfectly capable of enforcing, and then forces library authors to choose between UB and bounds checking overheads with a raises ValueError or abort. The inability to have the compiler help me find underflows in places where a negative number signals a bug is also unpleasant. It does make writing quick scripts easier, but my personal overlap between “quick script where I don’t care about long-term maintainability and debugability” and “I don’t really care about performance or can just use numpy/pandas” is fairly high.
it makes components written by disparate developers interoperate more seamlessly by default
Could you point me to languages which have had this ecosystem fragmentation around integer types? None of the languages I am familiar with have really experienced that problem in my subjective experience and I’d like to gather more data.
As I’ve said before, Rust should have the worst version of this problem of all major languages. To me, the low frequency of casts in the Rust data I’ve gathered compared to the Mojo stdlib + MAX make this less of a concern. I think that casts represent API contract mismatches, which I don’t think we can get rid of without leaning heavily on the compiler to look at value ranges and determine when implicit casts are safe or when bounds checks are redundant. I’d personally much rather have a cast than have the possibility of my hot loop growing a branch due to a compiler update because functions are forced to defensively bounds-check values.
There is no perfect solution of course, but in my experience, this is clearly the right tradeoff.
I agree there is no perfect solution, but I can’t help but see a wide variety of issues that result from what feels to me an odd reversal of course away from “we have a fantastic type system and are using it to get both correctness and performance” and towards a world where the compiler can’t help us as much and we are forced to handle things by hand.
Just to give an example of an anecdote of literally the beginning of this week. I was parametrizing the amount of threads a kernel had to spin up using gpu_info.threads_per_sm and it took me a couple of minutes to realize that the struct field itself had the value of -1. It is a perfect example of a concept that is wrongly expressed in code and a very dirty habit inherited from previous decades. If a value can be unknown it should be modeled that way. If a value cannot be negative, it should be modeled that way.
I do not think we should go down this path of chain passing Ints assuming they are bigger than 0. I do not think we should rely on light “read the documentation” contracts; I know a lot of developers that don’t ever read documentation (which is a reality we have to deal with). Safety isn’t only about memory safety, it’s about not letting developers be lazy where that laziness can cause UB.
Hey everyone — this thread has brought up a lot of thoughtful points, but it’s become a bit sprawling. I’ll try to summarize the main talking points so we know what the scope of the forum is actually discussing. There are a few topics all being discussed simultaneously so it’s difficult to keep track.
What type should be returned from len var x: ??? = len(collection)
What type (or type bounded on a trait) should index operations take fn __getitem___(idx: ???)
Collection APIs vs low-level pointer types.
Also pointer arithmetic.
What is the default type when a value it assigned to a literal var x: ??? = 123
What happens when the value is larger than the representable type?
What is the “correct” name for these types: Int/UInt, Size, Offset, etc…
(other talking points I didn’t capture here)
It would be helpful if we can condense people’s opinions concisely for these ~4ish points so we can document and have actionable decisions from this discussion. There’s obviously a bit of disagreement between people which is perfectly fine - but we should strive to come to a mutual understanding even if we are unable to all agree on individual opinions