Ask Joe anything about the Mojo standard library! 🔥

Today’s the day to share any questions you have about the Mojo standard library! @joe will be answering questions throughout the day in this thread :rocket:

5 Likes

When will we get early network code?

I think one good question for community discussion is; How do we organize the standard library? This question arises from a simple problem, most NPUs and other fixed-function accelerators can’t spawn threads, do disk io, or print. This means we need to decide what parts of the standard library are eligible to be used in code intended to run on them. Due to differing capability levels, ranging from NPUs all the way up to a DPU (fancy network card) that runs a different version of Linux than the host and has CPU cores and an onboard GPU/FPGA, I think we may want to express things in terms of device capabilities.

First, core, this is taken from Rust and used in much the same way. Anything in core is something which can be expected to exist on all targets. This means it will mostly be primitive type definitions, like SIMD, primitive traits, like Iterator, and pure functions (functions which are mathematical transformations or can otherwise be expressed with a sufficient number of nand gates in a single clock cycle). This level is what we consider the “base level”, and I intend it to target things which are Turing machines. Anything which can’t meet this capability level is a fixed-function accelerator and needs to be dealt with by exposing driver/device capabilities individually (I have thoughts on how to do this as well). If something doesn’t meet the requirements to run code from core, it is wholly owned by MAX as a fixed-function driver.

Next is the “general purpose compute” level. There will be things which are core plus a few bits, but the intent of the “general purpose compute” level is for devices which could reasonably support OpenCL, meaning recursion, heap allocation, etc. We can defer to the Khronos group on capabilities inside of this level. I think this will help device manufacturers figure out where their device belongs in Mojo, since anything that can use OpenCL is at least here, which is the base level for “useful to most programmers”. Devices at this level may be able to perform IO by having a CPU core take messages from them and perform IO on their behalf. It is fairly reasonable to expect devices at this level to have DMA capabilities.

After this is the “CPU peer” level, where fully featured GPUs like an H100, FPGAs, or devices like Xeon Phi live. At this level, we start needing to fall back to enumerated capabilities, since thing become too varied. The general unifying factor is that this kind of device only needs a CPU to kickstart the process, it can then go do its own thing after that. Targeting devices at this level means you are willing to deal with capability discovery, but are likely looking to maximize the compute you get out of a few particular devices.

Next, the “OS Host” level, which consists primarily of devices capable of running their own operating system. This is mostly going to consist of the host CPU and devices with their own onboard CPU core like FPGAs and DPUs. This is made up of devices that you can could SSH/Telnet into, and primarily exists because MAX needs to be aware of custom linkers, libcs, etc if it uses the CPU cores on those devices for compute.

Finally, a “hardware design” level. This is more or less exclusive to either simulators or FPGAs, and means you are using Mojo as a hardware design language. This means you have far more control over what is happening than you would on a CPU. If you want a vector register that’s 3 bytes wide, you can do that. If you want to bake an entire model into the FPGA fabric, go ahead. We can build a bridge from this level to tell a MAX driver how to init the device and then what capabilities the resulting device exposes, meaning this is an area used in the specification of MAX drivers.

These are general “levels”, since I think that specific capabilities are better, but we don’t want people to have to think about whether their the device their kernel is running on supports multi-byte DMA or whether it has an SRAM or DRAM cache.

3 Likes

I’ll take this one since I volunteered to implement it.

At a minimum, we need either trait objects, unions, or sum types first. Ideally parametric traits and parametric aliases as well. Right now, doing it in a way that would work for people not running recent (as in not technically supported by Mojo right now) versions of Linux would involve forcing you to be generic over quite a few parameters in every function that does networking. I want to avoid a situation where the stdlib gets stuck on some older network API that doesn’t work for state of the art.

Async also needs to be hammered out in more detail, and some more information about the async executor made public, since networking cares quite a bit about async.

We also have the question of how much networking we should try to support inside of MAX kernels. AMD and Nvidia GPUs can technically do ethernet networking if they bring their own network stack and the right kind of network card is installed in the server. That brings up the question of whether to just implement a full network stack for important protocols (like TCP/IP), and use kernel bypass to sprint past everyone even on CPU.

1 Like

I’d expect some early foundations of networking code (the basic building blocks) to come in 2025 perhaps even by the community. There’s nothing immediate internally prioritizing us to lean into this area right now which is why I suspect the community will “beat us to it” :slight_smile: .

If you’d like to be get involved with this effort, I believe @owenhilyard has explored this area a bit already and certainly has some thoughts too. From my perspective — in order to do this well, we would benefit from some core language features such as trait objects, along with parametric traits and aliases for example.

5 Likes

Yes, ⚙️ parametric traits can help a lot for iterators too :+1:

1 Like

Hey @joe will Mojo be ported out to different orgs like say PyPi or Conda etc. When do you see Mojo being implemented in academia and industrial grade workflows.

Do you mind elaborating on the first part of your question? Note that mojo ships within the max package that integrates nicely with the Conda side of things now. We’ve talked about what it would look like within PyPi, but no concrete plans we’re ready to share at this time. There are additional complications of that vs. Conda when it comes to packaging (both for mojo and max).

As for the second part of the question, do you mean it from a when will Mojo have a lot of the features required to be successful in using it in those domains, or more from a “when will it be a good time to teach Mojo?” in an academic setting?

For example, Python is globally distributed and maintained. C is as well. A lot of programming languages are used in academic settings and industrial grade settings in gaming, embedded devices, streaming, and the list goes on and on.

This is a great question. As you know, we’re primarily focused on GPUs right now (both NVIDIA and AMD) which is helping define some of these abstractions around device capabilities and what makes sense/is valid even on GPUs. Some things in the stdlib only make sense in the context of CPUs.

As we discover these things from building out internally, it will inform a bit of shuffling around of things in the stdlib. Abstracting out the “core” fundamentals you expect to work anywhere/everywhere makes sense (aside: I’m so glad we have a prelude module now we can at least re-export a fair bit of the core module goodies).

Some of the pieces continue to change as Mojo and MAX have grown up with regards to answering the questions of “does something belong in the stdlib or MAX?” - in other words, those lines get blurred even more so now. We look forward to iterating on answering your question together as we build out GPU (and other accelerator) support :rocket:

4 Likes