Summary:
The Mojo standard library currently supports exporting CUDA/HIP stream handles, but does not expose an API to import them. Mojo should provide a minimal public API to import stream handles and launch device kernels on them.
Use case: Importing stream handles is important for interop with large existing codebases written in CUDA/HIP. Large-scale HPC applications often consist of multiple, nearly-independent modules that are often written in multiple languages but need to tightly integrate with each other for performant execution. As a concrete example, AMReX [GitHub - AMReX-Codes/amrex: AMReX: Software Framework for Block Structured AMR · GitHub] is a widely-used large-scale simulation framework for U.S. Department of Energy and other mission-critical HPC applications. It expects to own device pointers and provide a stream on which applications should launch their kernels, and in exchange, it provides domain decomposition/partitioning and MPI communication to applications. To move data between GPUs across the network, it launches buffer-packing kernels on device and then makes non-blocking GPU-aware MPI calls. In order for this kind of tight integration to work in a performant and reliable way, AMReX needs to own the device buffers and device streams. Mojo today can use device buffers provided by external code, but can’t launch kernels on device streams provided by external code.
Architectural details:
It appears that the machinery needed to implement this is currently in the closed-source libAsyncRTMojoBindings library, so I can’t provide a working proof-of-concept at the moment.
At the API level, I think this could be a simple mirror of the existing stream handle export functions, with a public DeviceContext.import_stream(...) function that accepts a CUDA or HIP stream handle as an argument.
I have proposed a proof-of-concept design with API stubs here: add proof-of-concept API stubs for DeviceContext.import_stream() · modular/modular@e1620a2 · GitHub