Unions vs sum types

Continung discussion: @owenhilyard

  1. Unions and sum types are separate concepts. A sum type is checked, which is why we expect there to be control flow involved in accessing it, whereas a union is an unsafe construct with no expectation of safety and thus no need to check things.

  2. What you’re proposing eliminates one of the biggest benefits of sum types, exhaustiveness. When using sum types to specify a communication protocol, the ability to get compiler errors everywhere you need to update to deal with the new variant is valuable.

  3. It’s also very useful to do “deeper” pattern matching, such as Some(Ok(3)) .

Sum type is the wrong name for the right concept.

struct Indirect:
   var pointer: OpaquePointer
   var type_id: Int

struct Large:
    var age: Int
    var name: Int
    var gender: Int
    var tag: Int
    
struct Small:
    var age_name_or_gender: Int # This only works in safe code because the fields are the same type. Here, the name is stored in a string table. Storing it as a string does not work.
    var tag: Int

enum Small:
   var age: Int
   var name: Int
   var gender: Int
    
# 32Bytes vs 16Bytes

union Small:
   var age: Int
   var name: Int
   var gender: Int
# 9Bytes in Struct of Arrays or packed

I don’t know why there are so many different terms for the concept of a safe union.

safe union
tagged union
union (enum)
enum
Variant
Sum Type
Algebraic data type
Any other names?

Union: same offset in memory, size of largest field + padding #.access
Tag: int; can at least hold the number of fields.
fn tag(union: SomeTaggable) -> Int # bultin

The first example is simpler because it does not require references. Even if you don’t know anything about references, you can still mutate the fields of a union without having to argue with the borrow checker.

if name is .success:
    name.success += 1 

Alternative using references

if name is .success(ref name_success):
   name_success += 1 

This code would not compile.

if name is .failure:
    name.success += 1 

This code would not compile, log_name_result does not modify the tag

if name is .success(ref name_success):
   log_name_result(name)
   name_success += 1
   
fn log_name_result(read name_result: Result[Name]):    ... 

You can try unwrap and modify log_name_result, but you might not be able to because they’re external functions.

I hope you see why separating union access checking and borrow checking is a major simplification.

Saying that in one branch, you can get a copy, a mutable reference, or an immutable reference is too limiting.

I think safe unions are the way to go: expressive, no nonsense, natural code that is easily explainable.
They’re just a struct with fields at the same offset and a tag.

They avoid so many complexities, like the match ergonomics in Rust.

Why does Swift have so much sugar for the optional type, and why does Rust have so many methods on Option?

Swift: language complexity
Rust: library complexity

I think the model is flexible enough that all methods on Optional other than value and unsafe_value should be removed if that is implemented.
We don’t need map or flatMap; they make it hard to reason about control flow.

This proposal is simpler to implement because it doesn’t require fancy pattern matching, just union access checking, which is more of a control flow thing.

Python has done a great job with the match statement. I am not proposing to not implement pattern matching.

Exhaustive vs Non-Exhaustive:

Python’s match is not exhaustive.
Mojo could have something like this:

@exhaustive
match x:
    case .success(s):    pass
    case .something_went_wrong(w):   pass

Types could be annotated with something like @require_exhaustive.
This would be the best of both worlds: compatibility with Python and low boilerplate.

I don’t know why there are so many different terms for the concept of a safe union.

Sum Type is the name for it from type theory and category theory, and is the oldest name, which is why many people use it. struct would be a product type. Sum types have a number of states equal to the sum of the cardinality of the number of states of each variant. struct/Product Types have a number of states equal to the product of the number of states of each field. This is a really annoying thing to try to communicate to people, especially new programmers, so different names spawned for it as an attempt to make it better. Variant in Mojo and std::variant in C++ are a poor version of sum types because they don’t have the language-level mechanisms to do much aside from crash when the user does the wrong thing, and compile-time error detection is harder.

Algebraic Data Types, ADTs, is the category for both sum and product types. So, you can say that Rust has ADTs via its structs and its enums.

The reason it needs to have language support is first for match, which makes a lot of things much nicer, and second because you can do things like niche optimizations. For instance, a NonNullPointer[T] means that Optional[NonNullPointer[T]].None can be a null pointer, which means you get the same perf as you would in C but with mandatory error checking. All of the benefits, none of the correctness drawbacks. For many structs, it is just a tag, but niche optimizations help a lot in many cases involving numbers.

The first example is simpler because it does not require references. Even if you don’t know anything about references, you can still mutate the fields of a union without having to argue with the borrow checker.

A pure, c-style union usually doesn’t have a tag, and as such you have a hard time knowning which variant is “active” at a given time without information the compiler doesn’t know. This is why makes unions such a compiler hazard, but is also why they are one of the safer ways to do type punning in C. You have to ask the question of what the expected behavior of something like the following:

if name is .failure:
    print(name.success)

This invokes a lot of nasty questions, and as a result this kind of thing is often relegated to UB or IB and placed behind “unsafe” interfaces.

A match statement prevents this from happening by forcing you to do the check and grab the data in one operation. This avoids the problem that you mentioned with checking the wrong variant with much, much less compiler complexity. Nothing stops you from doing a read, mut or var version of match to get the behavior you want, and Rust actually does this. I don’t think I’ve ever really fought with the borrow checker around Rust’s way of doing sum types more than I have for a normal struct.

What you seem to be talking about is an equivalent to Rust’s

if let Success(name_success) = &name {
    log_name_result(name)
    name_success += 1
}

This uses pattern matching for only checking one variant, and you can have an attached “else” for the failure case.

Why does Swift have so much sugar for the optional type, and why does Rust have so many methods on Option?

The reason those methods on optional exist is because treating options as monads means that you can remove a lot of visual complexity and make it so that people can easily see the happy path, while the functions will just pass along a None. Nothing stops you from using a more match-based approach to doing the same thing, but for FP people it makes a lot of sense to provide those features and it’s not really a lot of complexity in the stdlib to add all of that. Something like an or_default(default_value) can easily be implemented by hand with match, but it’s a reusable bit of functionality a lot of people want. FP people will also say that a lack of map or flatMap makes it hard to reason about control flow because that’s what they’re used to. I don’t see any reason not to have both the imperative and functional styles since it’s trivial to implement at the library level.

Python’s match is not exhaustive.

imo, this is a mistake. Exhaustive match provides many, many benefits to refactoring and helps downstream users when code gets updated. The ability to have an “else” case handles a lot of the situations where a non-exaustive match makes sense. From what I’ve seen, exaustive is what people want in most cases so we should provide that by default. Python can’t do that because of limitations of python, but we can in Mojo.

4 Likes

OP was talking about a safe version of unions. He wrote an example that was very similar to yours and said “This code would not compile.”.

He’s envisioning flow sensitive type narrowing, in the style of TypeScript or Kotlin.

He’s envisioning flow sensitive type narrowing, in the style of TypeScript or Kotlin.

While that’s nice to have, I’ve seen that break far too many times. I’d rather just have if let and match and have it work consistently.

1 Like

It only breaks in languages with uncontrolled aliasing. That doesn’t apply to Mojo.

What I mean is that I’ve seen compilers fail to properly refine the type, especially past complex control flow.

Pattern matching is the canonical form of structural, flow-sensitive type narrowing (in a typed language with sum).

Yes, pattern matching fixes the issues that I see. I meant in terms of languages which don’t have pattern matching and thus are forced to do control-flow-sensitive type refinement for sum types, potentially past an arbitrary amount of control flow.

Right, and I agree with you. I meant to reply to Nick.

Example: when guard kills exhaustiveness checking.

1 Like
@unsafe_union_property_descriptor
@checked_union_property_descriptor
enum Optional[T: AnyType]
    var none
    var some: T
    @always_inline("union_check")
    fn __is__(self, _none: NoneType) -> Bool:
        return tag(self) == 0
    # constructed are checked based on functions with signature (enum, any) -> Bool
    @always_inline("union_check")
    @implicit
    fn __init__(out self, _none: NoneType):
        self = .none
    @always_inline("union_check")
    @implicit
    fn __init__(out self, var value: T):
        self = .some(value^)
    # generated by property descriptors ( overloads omitted for brivity)
    @property
    fn checked_union(self) -> CheckedUnionPropertyDescriptor[Self]: ...
    @property
    fn unsafe_union(self) -> UnsafeUnionPropertyDescriptor[Self]: ...
    # generated by the compiler for enums
    fn __tag__(self) -> Int: ...
    @always_inline("union_check")
    fn __is__(self, enum_literal) -> Bool ...
    @always_inline("union_check")
    @implicit
    fn __init__(out self, enum_literal) ...
    # calls union_access_is_safe 
    @property
    fn none(self) -> None: ...
    @property
    fn some(var self) -> T: 
        constrained[union_access_is_safe(self, enum_literal), "illegal access of union field, the compiler cannot prove that field .some is initialized"]()
        return self._union.some^

This API is more flexible and powerful than pattern matching. For instance, pattern matching cannot express lazy intersections (and). I consider pattern matching to be a less capable syntax that is often even unergonomic.
Relying solely on pattern matching leads to more branches, generating many dead basic blocks. This prevents optimizations, bloats compile times, and takes up memory.
Safe unions are a turbo optimizer - the user writes efficient code by default.

b = .success(c)
if a:
    b.success += 10

if opta is .some and optb is .some:
    opta.some += optb.some

Safety Levels

So why call it a checked unions? These are called unions because they are not considered compile time safe.
Checked provides runtime memory safety and Unsafe provides no safety guarantees. Opting out of this safety guarantee requires an explicit call to a union descriptor.

  1. safe
  2. checked
  3. unsafe

Note that in critical software such as the Linux kernel, a runtime panic is considered a security vulnerability.

I wrote a rough compiler desugaring of union_access_is_safe. Below is a short summary. Note that the type checker is completely independent of the union access checker; union_access_is_safe is only called if an enum field is accessed, which does not include @always_inline("union_check"). The union access checker usually only traverses a few basic blocks and is lazy by default.

This means that optional.some is required to access the value of an optional, there are no control flow dependent smart casts unlike in Kotlin.


# TRAVERSE UP
 # a. *this can be either an assignment `optional = .none` or a comparison `if optional is .none`
       if the last statically known tag* of "union_check" in branch != tag of union access:  
           return False
# TRAVERSE DOWN
# a. looking for `result = function_call()` or `function_call_ref(result)` where ref is mut
        if the enum tag of child_node is reassasigned or modified: 
            return False

We are talking about adding a function scoped and easily parallelizable compiler pass.
This can eliminate the need for many other, more complicated compiler passes and language features.

, such as a more complicated lifetime checker and a complex pattern-matching code generation pass.

It’s a short-term investment that simplifies the compiler in the long term.

For instance, pattern matching cannot express lazy intersections (and).

match (a, b) {
    (Some(a_val), Some(b_val)) => ...,
    _ => ...
}

Yes it can

So why call it a checked unions

This is why I don’t use that term, I use “sum types”. When I say “union” I am referring to C-style things.

Note that in critical software such as the Linux kernel, a runtime panic is considered a security vulnerability.

Pattern matching usually encourages you to handle all possibilities. This makes it more likely that you will avoid a situation where you accidentally access data that you shouldn’t have or accidentally bitcast data.

I wrote a rough compiler desugaring of union_access_is_safe.

Due to the existence of niche optimizations, that isn’t possible to do in a general case. What pattern matching does is it works more closely with the compiler to prove that a particular variant of the union is active, which then eliminates further checks in that scope.

We are talking about adding a function scoped and easily parallelizable compiler pass.
This can eliminate the need for many other, more complicated compiler passes and language features.

Figuring out whether a users has checked some property through arbitrary control flow is hard. For instance, how do you handle an await point between the check and the use? With pattern matching, the borrow checker protects the sum type’s variant and value. With arbitrary control flow, you’d have to have some mechanism to borrow values after they are checked, lasting until after the last use.

1 Like

Pattern matching is not merely syntax (even if it tends to come with its own sugary surface form), but the elimination form of sums. The whole point is that, in each branch, the compiler knows statically which variant you are in and what the payload type is, so the destructuring is safe by construction.

if let (Some(a), Some(b)) = ...?

Citation needed. Pattern matches are normally lowered to the same tag tests and branches you would write by hand, and often jumps, which are even more optimised.

A few other things:

  • We already (kinda) have irrefutable patterns in the language, so some form of pattern-matching support is already baked in.
  • If enum field access is entirely flow-sensitive, how does that interact with loops, higher-order functions, early returns, try ... exception ..., and so on?
  • For a type like type T = A(Int, String) | B(Float32), where a single constructor carries multiple payloads, you either return tuples/structs and immediately unpack them or you provide some equivalent of a deconstructing pattern.
  • When I called pattern matching “the canonical form of structural, flow-sensitive type narrowing”, I meant it in the boring, type-theoretic sense: for sums, it is the minimal, well-understood core operation. Any fancier refinement discipline still has to subsume the rigid cases that a plain match already handles.
  • The “implemented in the library” story is kinda a lie anyway once you introduce conventions like union_check and a magic __tag__. That is de facto language support. And something like __tag__ : Self -> Fix 2 (or Fix n in general) is exactly the sort of invariant you want the language enforce, since Int is not really precise.

The compiler <mlir> has a lot more targeted information(“union_check”) than LLVM has optimizing branches on a lower lower level. I don’t expect the union access checker to catch every valid union access but it would reject every invalid case; false positives yes; false negatives no. If false negatives occurred, , I wouldn’t expect you to have a good time with pattern matching either, since it is fundamentally inflexible. I have many more examples. Pattern matching rejects too many valid programs. Therefore, focusing entirely on patterns would not be a wise idea.

Niche optimizations don’t change any logic, they just store the tag in a layout which is invalid for a particular type.

this is not layz

LLVM can lower branches to swtichInt. I don’t really see a difference. The problems is not that the branches are inefficient. But rather that you need more branches.

What do you mean by “lazy”?

The trick is to choose a CFG representation that is low-level enough to abstract these details, yet high-level enough to avoid losing important data.

We could debate object-oriented versus functional versus data-oriented programming, and why the concept of sum types makes no sense. However, I would rather focus on the technical discussion.