Proposal: Allow importing CUDA/HIP stream handles in DeviceContext

Summary:
The Mojo standard library currently supports exporting CUDA/HIP stream handles, but does not expose an API to import them. Mojo should provide a minimal public API to import stream handles and launch device kernels on them.

Use case: Importing stream handles is important for interop with large existing codebases written in CUDA/HIP. Large-scale HPC applications often consist of multiple, nearly-independent modules that are often written in multiple languages but need to tightly integrate with each other for performant execution. As a concrete example, AMReX [GitHub - AMReX-Codes/amrex: AMReX: Software Framework for Block Structured AMR · GitHub] is a widely-used large-scale simulation framework for U.S. Department of Energy and other mission-critical HPC applications. It expects to own device pointers and provide a stream on which applications should launch their kernels, and in exchange, it provides domain decomposition/partitioning and MPI communication to applications. To move data between GPUs across the network, it launches buffer-packing kernels on device and then makes non-blocking GPU-aware MPI calls. In order for this kind of tight integration to work in a performant and reliable way, AMReX needs to own the device buffers and device streams. Mojo today can use device buffers provided by external code, but can’t launch kernels on device streams provided by external code.

Architectural details:
It appears that the machinery needed to implement this is currently in the closed-source libAsyncRTMojoBindings library, so I can’t provide a working proof-of-concept at the moment.

At the API level, I think this could be a simple mirror of the existing stream handle export functions, with a public DeviceContext.import_stream(...) function that accepts a CUDA or HIP stream handle as an argument.

I have proposed a proof-of-concept design with API stubs here: add proof-of-concept API stubs for DeviceContext.import_stream() · modular/modular@e1620a2 · GitHub

Hello,

Thank you for the proposal!

We actually already have something similar internally, so I’ve just added some bindings for it instead, it should be available soon (hopefully in tomorrow’s nightly).

It will be a little different, it adds a new create_external_stream on the DeviceContext that you can pass your handle to, and it will give you a DeviceStream object around that handle.

And then you can enqueue your kernels on that DeviceStream, note that DeviceStream doesn’t have an enqueue_function overload that handles compilation so the kernels needs to be compiled first with ctx.compile_function, but otherwise it should work the same.

Let us know of that works for you!

Thanks, that will probably work. I’ll try it out once it lands in the nightly compiler.

Finally got to testing this. It works great! Thank you!