Reviewing the Iterator/Iterable Abstraction

I would be fine with staticmethods if there is no other solution than using objects.

Generator syntax

Introduce an decorator to mark accessible fields. Otherwise code for every variable would have to be
generated which could lead to unnecessarily large structs and less optimized code.
The coroutine_name decorator can be used to explicitly name the coroutine.

extension[T: Copyable & Movable] List[T](Iterable):
    alias Element = T

	@coroutine_name("ListIterGenerator")
    fn __iter__(ref self) yields Self.Element:
    	# can't be optimized out
    	@coroutine_state 
    	count = 0
    	while count < len(self):
            yield self[i]
            count += 1


struct ListIterGenerator[T](Iterator): # fields generated by the compiler
	var _list_: UnsafePointer[List[T]]
	var _goto_state_: UniqueGotoState
	var _count: Int
	
	fn _get_count(ref self) -> Int:
		# internal logic
	
	# iterated so far extension
	fn iterated_so_far(ref self) -> Int
		return self._get_count()

You know, lambdas are limited to 1 expression, which makes it likely that your example will have to refactored at some point.
I don’t think using 4 iterator adapters is more readable than any of the alternatives.

def return_generator():
	for x in X:
		if cond_x(x):
			for y in Y_fn(x):
				if cond_y(x, y):
					yield expr(x, y)
					
def iterate_in_place():
	for x in X:
		if cond_x(x):
			for y in Y_fn(x):
				if cond_y(x, y):
					use_in_plnace(x, y)

Also your example does not show why iterators need to be methods:

return X.iter()
     .filter(lamda x: cond_x(x))
     .map(lambda x: (x, Y_fn(x)))
     .filter(lambda (_x, y):  cond_y(y) )
     .map(lambda (x, y): expr(x, y) )

# can be expressed as:

     unnamed_tempory1 = filter(lamda x: cond_x(x), X)
     unnamed_tempory2 = map(lambda x: (x, Y_fn(x)), unnamed_tempory2)
     unnamed_tempory3 = filter(lambda (_x, y):  cond_y(y), unnamed_tempory3)
     unnamed_tempory4 =	map(lambda (x, y): expr(x, y), unnamed_tempory4)
     return unnamed_temporary4
 

I like take, zip and group_by. Generator expression return an iterator, which means they can be combined.

I don’t think this is a good idea, since not all iterables are suitable for iterating. With functions there are also no downside to using Iterable.

The downside to Iterable is that there’s no current way to express that trait that allows for returning owned copies of the underlying thing being iterated. That’s why I’m arguing for it to be removed, so that we can have owning iterators used in for loops/comprehensions.

I also think it’s reasonable to investigate removing the Iterable trait, or at least replacing it with something more comprehensive.

In addition to the issues Seth raised regarding Iterable, it’s quite confusing to have two distinct traits (Iterable/Iterator) for a single language feature. It will likely raise endless questions about the difference between the two traits, and how to use them properly. (“Should my generic function ask for an iterator or an iterable?” “Is a list an iterator or an iterable?”)

We could rename Iterable to Sequence as a starting point, and then offer multiple ways to iterate over a sequence. (By ref, by copy, by “consuming iterator”, etc.)

This is definitely a good idea. This inspiration I believe partially came from Rust where they have Iterator and IntoIterator. Iterable is synonymous w/ Rust’s IntoIterator, however its spelling is so close to that of Iterator that it can definitely cause confusion.

One thing to consider here - I don’t know if generators/yield is on the immediate timeline for the mojo compiler team, so if we wanted to go w/ the yield route, iterators would be blocked until then.

Also as a wild idea it could perhaps be fun if iterators acted a bit more like how functional languages work or how C++ views/ranges work. So syntax could look like:

x.iter() 
    | filter(lambda x: cond(x))
    | map(lambda x: foo(x))
    | filter(...)
    | map(...)

Granted this would require overloading the | (or) operator, but always fun to think of interesting ideas. Since we don’t have to go w/ Rust’s approach, as other languages have also taken a stab at this. (Not saying one is better than the other, just worth looking at our options).

2 Likes

I see Iterable and Iterator as separate but closely related concepts. Iterable is a type with an associated Iterator. Iterator is a type which can be used to walk along some sequence of elements.

Strong thumbs down on removing on calling Iterable Sequence, python uses Sequence to mean Iterator, and this will just cause confusion.

3 Likes

Generator Syntax

One thing to consider here - I don’t know if generators/yield is on the immediate timeline for the mojo compiler team, so if we wanted to go w/ the yield route, iterators would be blocked until then.

I think it’s a worthwhile point of disscussion. If we know we want to use generators to define most iterators, we need to make long-term decisions that are compatible with that. For example, we will need the ability to extend a generator to add things like bounds. I agree that it may mean that iterators aside from what we need to make a for loop would be blocked until then, but that may be acceptable if it means we end up in a better place.

1 Like

While the pipe operator is interesting, I think it adds a lot of annoying overhead to writing iterators which I don’t see a reason to have. Dot accesses allow for a similar style of chaining and I’ve mostly seen a pipe operator introduced to help avoid sum filter map filter reduce chains in languages that have “inside-out” iterators to make them behave like the chained dot accesses Rust has.

2 Likes

To bikeshed:

This isn’t factually correct. A Python Sequence is just an Iterable with an extra affordance, namely the ability to ask for the nth item in the sequence.

Regardless, we are not beholden to the definition of Sequence in Python. It’s an abstract base class that 95% of Python users have never heard of. The Mojo stdlib can easily define its own version of Sequence just as we define our own version of the list and string types.

The main reason to use Sequence would be:

  • The term “sequence” conveys intuition to learners. Every adult human knows what a sequence is. Every data structure that can be iterated over is fundamentally a sequence, as a consequence of having an iteration order. This is a clear and simple mental model.
  • The term “sequence” avoids the visual/verbal similarity between the words “iterable” and “iterator”, which would be constantly appearing alongside one another in code and the docs. This is a huge learning aid, but also, having visually distinct words for these two concepts will improve code readability even for veterans.

Anyway, maybe Iterable is fine. I am not going to push this idea any further.

I believe that the filter-map approach is much more readable, because the function names clearly indicate to the reader what operations are performed on the data.

For example, It is conceptually much easier to chain standard iterator adapters, compared to having to read the specific implementation of a for-loop to realise that the condition of the for-loop acts as a filter.

I realise there is a small difference between reading standard iterator adapters vs a condition in a for-loop, but the use of iterators improves:

  1. Code Readability
    • The iterator’s name clearly indicates the operations being performed
  2. Code compossibility
    • Because the iterator adapters use lambdas to perform operations, the specific lambda being applied, can be determined by complex code elsewhere in the code-base, and easily passed to the iterator adapter in a clear and ergonomic way.

With your example, (from what I understand) each operation produces a new copy of the collection.
This is very inefficient for large collections of data.

By using methods, operations on large collections could be lazy-evaluated.
This means that operations on a collection are performed on each item sequentially through the entire chain until last operation is reached. This avoids the creation of intermediate collections and reduces memory usage.

Using the Kotlin language as an example, there is a performance distinction between eager-evaluation (List) and lazy-evaluation (Sequence).

With eager-evaluation, an intermediate collection is created for each operation (at the cost of increased memory usage). This is good for operations on relatively small collections.

With lazy-evaluation, no intermediate collections are created and elements are only processed when needed (at the cost of function-call overhead). This is good for large collections.

Of course, the behaviour for both use-cases could be achieved by hand-writing for-loops and conditions, but I’d wager that this would lead to more cumbersome and error-prone code.

Neither an operator or chaining is the solution to complex control flow. Python has solved it, just use loops or generators.

I don’t think it is simpler. How are you supposed to know when a side effect is executed. Iterator adaptors are useless for any slightly demanding task

Do you actually write this code:

 for x in map(mapper, iter(iterable)):
    pass
 for x in filter(filterer, iter(iterable)):
    pass

Instead of just

for y in iterable:
    x = mapper(y)
for y in iterable:
    if not filterer(y):
      continue

Map and filter are particularly useless, because they don’t require an induction variable. They are also harder to read because of the closure.

Technically you need to use the transfer operator^. Even considering this, I don’t think the ergonomics are significantly worse.

`$1` = filter(lambda x: cond_x(x), X) # Using backticks here cause `$` isn't an identifier.
`$2`  = map(lambda x: (x, Y_fn(x)),  `$1`^)
`$3` = filter(lambda (_x, y):  cond_y(y), `$2`^ )
`$4`  = map(lambda (x, y): expr(x, y), `$3`^ )
 return `$4` ^

With this convention, methods and functions are guaranteed to produce the same code.

Don’t compare a Java VM with C++, Rust, or Mojo.

Loops are fine. If you want to flatten the logic, you can always use a generator.

Generator expression don’t require coroutine support and are doable short term. This should alleviate the need for iterator methods.

Methods also have the disadvantage of accepting an iterator and not iterable. Duplicating for these use cases is not desirable.

Readability over compactness and cleverness.

If AI assists you writing the code, you actually need to audit the code. Thus you spend most of the time reading code and writing speed isn’t an issue.

While the pipe operator is interesting, I think it adds a lot of annoying overhead to writing iterators which I don’t see a reason to have. Dot accesses allow for a similar style of chaining and I’ve mostly seen a pipe operator introduced to help avoid sum filter map filter reduce chains in languages that have “inside-out” iterators to make them behave like the chained dot accesses Rust has.

Neither an operator or chaining is the solution to complex control flow. Python has solved it, just use loops or generators.

In my opinion, python’s generators (as used in list comprehensions, not the yield ones), tend to get unreadable as you stack up longer chains. It’s usually not a simple map and filter where problems happen, but one where you want to grab a window of a given size, map it, filter that result, make another window, filter that, only grab the first value of the window, and then convert the result to a list.

How are you supposed to know when a side effect is executed[?]

You’re assuming that side effects exist. Iterator adapters, coming from an FP school of thinking, tend to represent pure computations that are mostly going to be a chain of operations that eventually drop stuff into a collection or results in a single value after some reductions.

Do you actually write this code:

for x in map(mapper, iter(iterable)):
   pass
for x in filter(filterer, iter(iterable)):
   pass

You’ll more often see iterator adapters used to transform code like this:

for i in foo:
    if i % 2 == 0:
        if is_prime(i):
            bar.push(i ** 2)

into this

var bar = foo
   .filter(lambda i: i % 2 == 0 and is_prime(i))
   .map(lambda i: i **2) 
   .collect()

As you add more levels of nesting and conditions, the readability benefits of iterator adapters go up.

5 Likes

According to your definition, which syscalls have side effects and which do not?
I suppose malloc/mmap is not a side effect, because functional programming needs a lot of memory, until you run out of it.

Don’t compare a Java VM with C++, Rust, or Mojo.

This is a valid point of comparison. Lazy evaluation is used in Rust because it often preserves cache locality much better than eager evaluation, which means that it will have higher performance, and it makes it so that you don’t have to figure out where to store an unknown quantity of data in the mean time, you just have some stack space.

One thing to consider here - I don’t know if generators/yield is on the immediate timeline for the mojo compiler team, so if we wanted to go w/ the yield route, iterators would be blocked until then.

Generator expression don’t require coroutine support and are doable short term. This should alleviate the need for iterator methods.

In Python, Generator Expressions and Generators are wholly separate concepts. Generators use yields and are usually the return value from a function. Generator expressions act as a form of iterator adapter.

Methods also have the disadvantage of accepting an iterator and not iterable. Duplicating for these use cases is not desirable.

I don’t think doing .iter() or .into_iter() is a substantial overhead, especially since Mojo needs somewhere to allow the user to specify what kind of operation is being done. Until Mojo can be generic over owned and borrowed values, we’d have to overload all function-based iterator adapters on ownership anyway.

4 Likes

You’re assuming that side effects exist. Iterator adapters, coming from an FP school of thinking, tend to represent pure computations that are mostly going to be a chain of operations that eventually drop stuff into a collection or results in a single value after some reductions.

According to your definition, which syscalls have side effects and which do not?
I suppose malloc/mmap is not a side effect, because functional programming needs a lot of memory, until you run out of it.

var collected = List[Float32](capacity=32)
var input_items = List[Int](capacity=64)

# get the input ready
for i in range(64):
    input_items.append(i)

# actual iterator stuff

input_items.into_iter()
    .filter(lambda i: i % 2 == 0)
    .map(Float32)
    .map(math.sin)
    .collect_into(collected)

I see no extra allocations here. I can reuse collected by passing it into the function that does the computation for as long as I want. It could be built on top of a static piece of memory I allocated in the binary if we had custom allocators and this would still work just fine. The iterator code is exactly the same as:

for i in input_items:
    if i % 2 == 0:
        collected.append(math.sin(Float32(i)))

But, it’s easy to add more stuff to, and if input_iterms were an InlineArray, this could be vectorized without any compiler magic.