A more powerful `@parameter for`

As seen in https://github.com/modular/modular/issues/5787, some of the behavior around as used for performance reasons is starting to clash with Mojo’s goal of good compile times. I think that this signals that there may be a need for something a little less heavy handed than “unroll this whole loop the entire way”. Additionally, in order to handle dynamic values, it would be useful to have the ability to make an @parameter decorator only perform unrolling to some compile-time known degree, with a drain loop to handle the tail. Finally, @parameter to perform loop unrolling is somewhat confusing to some users, and the explanation about allowing full compile-time unrolling of loops to help satisfy the type system can further confuse users not used to powerful type systems.

I think it’s best to take a stab at tackling these pre-1.0 since it’s going to be very annoying to tell everyone to do a find and replace later on if a name change is desired.

To fix this, I propose a new decorator, @unroll(*, comptime: Bool = True, unroll_limit: Optional[Int] = None). When unroll_limit is None, fully unroll the loop, otherwise, only unroll up to N iterations at a time. Since unrolling a loop of infinite size doesn’t work, this forces it to be compile time, and we can provide the user a message about not being able to unroll an infinite loop since the loop size can’t be determined. If the user provides a limit, and comptime is True, then the number of iterations in the loop must be less than or equal to unroll_limit. This can help guard against oversized loops. When comptime is False and a limit is provided, the macro will produce a loop which does limit iterations, and the equivalent of a switch ladder that can be jumped into which will process the remainder of the iterations.

This “enhanced unroll” will allow for more easy metaprogramming around values that may or may not be known at compile time, enabling kernels to more easily adapt to dynamic values without massive code duplication, and allowing the propagation of MaybeComptime-type values cleanly around a kernel. This is my first stab at the problem, so I’m sure there are better options from a UX standpoint but this at least lays out what capabilities I think are desirable to start the discussion.

cc @christoph_schlumpf @clattner @nate @martinvuyk @CreativeNothing

5 Likes

I like the idea of better control over comptime unrolling. I was searching for prior art and found the following:

[BUG] unrolled loops in stdlib killing compilation (LayoutTensor copy_from()) · Issue #5787 · modular/modular · GitHub shows the following warning:

parameter for unrolling loop more than 65536 times may cause long compilation time and large code size. (use ‘–loop-unrolling-warn-threshold’ to increase the threshold or set to 0 to disable this warning)

Maybe a more pragmatic short term solution would be to have a fixed maximum unroll limit of 65536 in Mojo that can be overriden at build time. So the above warning could be changed:

parameter for unrolling is limited to 65536 times to avoid long compilation time and large code size. (use ‘–loop-unrolling-limit’ to change the limit or set ‘–loop-unrolling-warn’ to ‘false’ to disable this warning)

This might avoid the long compilation time mentioned in the issue and just compile fine.

The advantage of this solution over a powerful @unroll is that the best unroll limit might be more dependent on the machine the code is built and deployed on than on the developer who wrote a specific part of the code. If the unroll_limit is set too high by the developer it might never compile on the target machine. If the unroll_limit is set too conservative by the developer it might unnecessary limit performance.

Concerning run-time unrolling: I don’t understand exactly how this would be done.

Thank you for starting this thread Owen. Before brain dumping some thoughts, let me take stock of the current situation.

What actually is ‘@parameter for`?

The current @parameter for structure. Isn’t a “compile time for loop”, in fact, it isn’t a loop at all! While this construct syntactically looks like a loop, it actually fully unrolls the iteration space given to it. This has a number of benefits (including reducing loop overhead) and a number of drawbacks (spamming out a huge amount of codesize as you point out), but the most material thing it does is that it makes the induction variable a parameter:

@parameter
for i in range(4):
    # Can use i as a parameter.
    ... SIMD[dt, 1 << i] ...
    # parameters materialize to runtime values of course:
    print(i)

This behavior is only achievable if you fully unroll a loop, which is what this construct does. If you don’t fully unroll, the induction variable would have to be a runtime construct.

What should this actually be named?

Given that this isn’t actually a loop at all, calling this thing a “for loop” seems pretty dangerous and misleading, which is one of the reasons this gets abused. IMO, you could go so far as to say that the current name is abhorent! We need to rename this, and I’d propose something explicit and clear, e.g. a strawman would be:

comptime_fully_unroll i in range(4):
    # Can use i as a parameter.
    ... SIMD[dt, 1 << i] ...

Doing a rename like this would make it far more clear what the behavior is, and would reduce a certain amount of abuse where people use it for general “unroll for performance” - which isn’t its intended use-case. People use it for this purpose because it is misleadingly named, but also because we don’t have a collection of better alternatives they could choose from to solve their problem.

How does this fit in with library-based design?

It turns out that Mojo is a language that wants to push things from the language into the standard library. Indeed we used to have a function named unroll (as a peer to vectorize) that did partial unrolling of iteration spaces. Even the existing @parameter for approach is also expressible in the standard library as well, today, e.g:

    fn body[i: Int]():
       # Can use i as a parameter.
       ... SIMD[dt, 1 << i] ...
    comptime_fully_unroll[body](range(4))

The primary downside is syntactic sugar: the existing statement supports destructuring, break statements, and supports putting the body “where it is supposed to go” without the burden of defining a nested function. The current syntax is substantially more ergonomic than a library-based solution. We could remove this from the langauge and go with a library-only solution today, but doing so would be a pretty bad regression for commonly-used functionality.

That said, at the same time, I don’t support adding a ton of bells and whistles to this thing: It is not at all about runtime partial unrolling - such a thing would have to get its own new name and design (eg the induction variables would be runtime).

Further, even though we already hard coded this case into the language, I’m also not excited about expanding the language with other special cases. There are a LOT of loop transformations beyond unrolling that are useful (e.g. vectorize, parallel for loops, unrolled loops, etc etc etc) and putting these all into the language isn’t appealing to me.

A possible future direction

The right direction IMO is to expand the ability to model these capabilities in the library, and aspire for comptime_fully_unroll to become a library feature some day (which is perhaps auto imported). I can imagine a few steps towards this:

  1. Once the closure work settles, we will introduce lambda syntax making things slightly more ergonomic.
  2. Beyond control flow statements, higher order functions are a thing in general, and while we need to support python-style lambda syntax, it is quite onerous for simple things. I think it would make sense to explore Scala-style => syntax sugar at some point.
  3. Similarly, we can explore allowing higher order functions to have their body written as a nested block, but treat such a nested block as an implicit closure argument. Swift did something conceptually similar with it’s “trailing closure” syntax: Trailing closure syntax - a free Hacking with Swift tutorial
  4. One problem with control flow statements as functions is that break/continue/return don’t work the right way. Solving this is work, but not conceptually difficult, it would be a new closure effect.
  5. Getting pattern binding to work will require something more fancy. I think we will eventually want to explore a macro system or some other way to enable this.

This is a long path and not all of it is likely to happen for Mojo 1.0. As such, I’d advocate for this approach:

  1. resyntax the existing @parameter if and @parameter for structures into something that could be subsumed in the future by a macro. This is why I think a new name with clear semantics that looks like a statement(eg comptime_fully_unroll ) is the right way to go.

  2. Hold back on adding more stuff, instead, implementing these things as algorithms in libraries.

  3. Continue to build out the core language features that enable these sorts of things to be defined in libraries and used ergonomicly despite that. Ideally eventually allowing this to revert back to being a standard library feature. Perhaps if we’re good enough, we can also get with and for and maybe even if and try to also become library features. :slight_smile:

In any case, these are all just my personal thoughts, not speaking on behalf of the Mojo team. The Mojo team definitely has consensus that we need to rename @parameter if and @parameter for before Mojo 1.0 though.

-Chris

5 Likes

Thanks a lot for this clarification and insight into future directions.

What is not clear to me:
If using @parameter for just to unroll a loop for performance reasons is an anti pattern, how to do it the right way? Will a normal for loop be unrolled internally by some heuristics if the size is known at compile time?

@parameter for requires that range is known at compile time. And it guarantees that the loop index will be a comptime value. So to me, referencing the ‘comptime’ word in the name would make sense (this produces a comptime value for index).

fully_unroll is ambiguous: I can write this loop: fully_unroll i in range(N, N+16): and then ask “why this does not work, the loop is fully unrollable”.

So I’d rather have some more explicit name for the former @parameter for, and reserve the “unroll” for the future unroll decorator (or even a more general, more powerful decorator that can do partial and complete loop unrolling, loop switching, unroll-and-jam etc, similar to this)

2 Likes

I think Mojo does not have this capacity now.

We definitely should add one. Chris mentioned that we used to have a library function that would take a lambda and a range, and call the lambda N times, thus fully unrolling the loop. It was removed some time ago, and it would be easy to resurrect.

It’d be usable as

fn body(iter: Int):
  out[iter] = in[iter+1]
unroll[body](range(1,10))

Imo, that would not be a perfect solution, but an ok workaround until the proper unroll pragma or decorator is added.

1 Like

Concerning the potential abuse of @parameter for: Would it be possible to add a compile-time warning if the loop variable is not used as a parameter inside the loop and then point to using a resurrected unroll instead?

BTW: How about replacing @paramenter for i in range(x): with for[i in range(x)]:? This would clearly indicate that both i and x must be known at compile-time and therefore be fully unrolled. I would call this a “parametrized for” :wink: .

I agree that library-based would be better, but as you said we can’t expect macros in Mojo for a while, and until we have those library-based solutions are going to require some fairly ugly workarounds. I don’t have any attachment to the syntax so making something that can be replaced with a macro in the future is fine by me.

On closures, my main concern is where we’ll put captures. I can’t think of a good place to put them on scala-style closures.

Until we have macros or closure effects, losing access to loop-based control flow is really not great. I think having break, continue, etc work is a pre-req. The reason I’m pushing on unroll is that it is causing some problems, especially with larger than expected inputs, and results in a lot of people writing unrolls by hand (see also the InlineArray init for the fill constructor). Additionally, I can’t think of any modern hardware that actually likes when you fully unroll a several thousand iteration loop, most have icache-related complaints about that.

I also think that many people, myself included to a degree, want something which is “for loop shaped”, regardless of the equivalence of the output assembly. An unroll as Denis proposes should be reasonably equivalent if a bit annoying to write given the current verbosity of Mojo closures and inline functions. I can go write the variants I discussed as desirable if we want to go that direction, but I can’t test most of the kernel code so I probably shouldn’t be the one to do the refactoring.

1 Like

… using @parameter for just to unroll a loop …

Here’s the thing: I think most of this confusion (in this thread) stems from a slight misuse of natural language. As Chris already pointed out, @parameter for doesn’t unroll a for loop it decorates, and it doesn’t run a loop at compile time either, because there isn’t a loop to begin with. I’m mostly reiterating that, but with a slightly different emphasis.

To even talk about unrolling in the ordinary sense, we need an actual loop construct. @parameter for is not that. It is a different statement form that resembles a for loop, and it happens to produce code that looks exactly like a fully unrolled loop. It stamps out the body many times like a macro, so-called “guaranteed unrolling”.

This also flips the usual story people tell. It is not “the induction variable is known at compile time, so we can unroll, and we do unroll it”. It is “the statement is macro-like stamping, so the variable must be foldable at compile time, and it is folded” (“macro expansion” can only happen at compile time). Note that it is not a real induction variable either. We just name it that way because it sits in the “induction variable position”.

More crucially, the stamping is a “meta level” operation, which is why continue and break work. If @unrollwere just a heuristic, we could semantically make it a no-op and move everything back to runtime. A bad loop in this setting could compile fine and only the generated program diverges; here however compilation itself will necessarily diverge.

I think once this “primitive” view is in place, most downstream implications become fairly direct. Code size and performance implications follow immediately. Questions like whether we should have additional knobs to tune unrolling become questions about adding a different primitive, or adding a wrapper (I think the answer is no, and from what I gather, Chris hold this view as well). For instance, we could very well have a syntax-based wrapper in the future, a real decorator that simply lowers to today’s unrolled_for (I’m not arguing for or against it).

As a thought experiment, we could even have an unrolled_while statement. It would do the same kind of compile time stamping. It is usually senseless to say we want to fully unroll a while loop, and that is precisely the point: in this conversation, it’s not even the ordinary “unrolling as a program transformation” that we are talking about.

@denis I think unrolled_for is a good enough name for this statement. It simply requires the range to be a comptime value, which is an arbitrary restriction of the construct itself, not a restriction on what would be suitable in an ordinary for loop. Mentioning comptime tends to draw attention to a comp time vs run time distinction that is derived, not fundamental.

1 Like

Hey Sora, I agree with you, you reasoning is sound and explanation is loud and clear!
One thing though:

think unrolled_for is a good enough name for this statement.

I am in doubt about the name. The name should hint at the fact that “this is not a simply unrolled loop”. Otherwise we will have people continuously asking

What is the difference between unrolled_for and “@unrollunroll \n for … “

You mentioned in some other thread that names should be “cryptic enough” or “subtle enough” and this is where I think your reasoning would apply as well.

comptime_unroll or

@comptime for or even

comptime for would all send that signal that “this is not a for loop”

I think comptime for is the most reasonable option to replace what @parameter currently does, since it expresses that it is doing the loop at compile time fairly clear.

Ideally, I would like to separate out “unroll for performance reasons” from the version that needs to be fully run at compile time since the two start to clash as loops get larger and you blow past your icache. Right now, I have a lot of code that looks like this:

for i in range(0, align_down(n, UNROLL_FACTOR)):
    @parameter
    for j in range(0, UNROLL_FACTOR):
        ...

@parameter # only sometimes if n is compile-time known
for i in range(align_down(n, UNROLL_FACTOR), n):
   ...

And then UNROLL_FACTOR is a parameter to the function that could be autotuned on (if we ever get that functionality back), or hand-tuned. I think that the general annoyance of writing that construct over and over may be part of why so much code uses @parameter and blindly unrolls the whole loop. Library-based design is nice, but only if it’s also easy to use and I think the lack of use of the old unroll is a sign that it’s not easy enough to use yet.

Great summary Sora. FWIW, I’m +1 Denis’ position that including “comptime” in the name is a good thing to do.