Reposted from the discord.
It seems like there isn’t any room in that design to support arena allocating the frames, nor any place for handling the allocation of a coroutine frame failing.
This is somewhat concerning to me because while being able to move to stack allocations is nice, being able to grab a right-sized allocation from an arena allocator is nicer, especially in the context of ensuring you have enough memory for the coroutine. For frequently allocated coroutines (consider the handle_request
top-level function of an HTTP server), this means that instead of going through all of the machinery in tcmalloc you may be performing a dequeue operation on a ring buffer of free frames, substantially faster.
Would it be possible to have the coroutine take an alloc: Allocator[CoroutineFrameType] = DefaultMojoAllocator
parameter in some way or otherwise inject an allocator into the coroutine? I’m still thinking over how I would want custom allocators to behave, but I know that this is a feature I and others will want.
For my specialty of databases, not being able to handle allocation failures (because the database is likely the largest memory consumer on any system it is on and typically has a lot of caching, so it can actually do something about allocation failures), means that you can’t use the feature in production code because it could lead to unnecessary crashes.
One other question was that it didn’t look like the frame was a tagged union as I would expect. Is it represented that way at the level we would see in MLIR reflection?
What is the roadmap for the async implementation?
Your last slide touched on function coloring. How does Modular plan to address that problem, if at all?
Have you had time to look at Rust’s implementation of coroutines? Can we get a comparison b/w Mojo’s coroutine implementation and Rust’s?
Are there docs or examples of how one could use the current async implementation in community libraries? Or is it too early for this
Hi @owenhilyard, thanks for the question!
There is room to support custom allocation techniques. The coroutine lowering does depend on an allocator, but we have the power to specify which allocator we use. That specification is not yet exposed in the language. The plan was to migrate from using a malloc call to a bump pointer allocator. The allocator would be created when invoking a coroutine from a synchronous context and passed down. I don’t see why we could not expose this in mojo to allow for user customization. This would only apply to memory coroutines and is separate from memory promotion, which can be skipped using a flag.
Re: handling the allocation of a coroutine frame failing, Mojo async is under early development and that case is not currently handled. However we do have a place to handle that situation. Async functions can be throwing functions and coroutines already have error slots so any failure to create a child coroutine can fail in the resume and propagate up the chain via the error slot.
Hey @ivellapillil Mojo Async is under early development and paused to address some higher priorities but I expect that sometime early to mid next year we will have a production quality version to release.
Hi @taylorpool! Thanks for asking. At this time, we don’t have plans to address function coloring.
Hi @Brian-M-J I have not yet examined Rust’s implementation but I would like to and upon release we can include a side by side comparison.
Hi @a2svior, great question. We don’t have docs or examples yet, but we’re hoping to do this in the new year.
A few more questions since it’s been a bit and I’ve spent some time talking about async with @Nick.
How is the async scheduler designed right now? While work stealing is great for some workloads, it causes a lot of headaches for others with either cache misses or requiring thread safety bounds (like Send + Sync + 'static
in Rust), due to the inability to determine whether a task will end up on another thread. This causes issues for things like io_uring, which is designed to either have one thread do all of the io, have an “IO lock”, or create an io ring per core. Thread per core isn’t as great at work sharing, but tends to make these issues go away and align better with shared-nothing designs.
Has any attention been paid to being generic over “asyncness”? Rust ran into issues with this essentially forcing everyone who talks to a database or does things with HTTP into using async.
Now that’s interesting. Is this side by side comparison only going to be between the stackless coroutines of C++, Rust and Mojo or are you considering other concurrency models as well (like Go and Hylo’s stackful coroutines, futures/promises, C++26’s senders and receivers etc)?