Proposal: Allow importing CUDA/HIP stream handles in DeviceContext

bwibking · March 12, 2026, 5:35pm

Summary:
The Mojo standard library currently supports exporting CUDA/HIP stream handles, but does not expose an API to import them. Mojo should provide a minimal public API to import stream handles and launch device kernels on them.

Use case: Importing stream handles is important for interop with large existing codebases written in CUDA/HIP. Large-scale HPC applications often consist of multiple, nearly-independent modules that are often written in multiple languages but need to tightly integrate with each other for performant execution. As a concrete example, AMReX [GitHub - AMReX-Codes/amrex: AMReX: Software Framework for Block Structured AMR · GitHub] is a widely-used large-scale simulation framework for U.S. Department of Energy and other mission-critical HPC applications. It expects to own device pointers and provide a stream on which applications should launch their kernels, and in exchange, it provides domain decomposition/partitioning and MPI communication to applications. To move data between GPUs across the network, it launches buffer-packing kernels on device and then makes non-blocking GPU-aware MPI calls. In order for this kind of tight integration to work in a performant and reliable way, AMReX needs to own the device buffers and device streams. Mojo today can use device buffers provided by external code, but can’t launch kernels on device streams provided by external code.

Architectural details:
It appears that the machinery needed to implement this is currently in the closed-source libAsyncRTMojoBindings library, so I can’t provide a working proof-of-concept at the moment.

At the API level, I think this could be a simple mirror of the existing stream handle export functions, with a public DeviceContext.import_stream(...) function that accepts a CUDA or HIP stream handle as an argument.

I have proposed a proof-of-concept design with API stubs here: add proof-of-concept API stubs for DeviceContext.import_stream() · modular/modular@e1620a2 · GitHub

npmiller · March 16, 2026, 12:23pm

Hello,

Thank you for the proposal!

We actually already have something similar internally, so I’ve just added some bindings for it instead, it should be available soon (hopefully in tomorrow’s nightly).

It will be a little different, it adds a new create_external_stream on the DeviceContext that you can pass your handle to, and it will give you a DeviceStream object around that handle.

And then you can enqueue your kernels on that DeviceStream, note that DeviceStream doesn’t have an enqueue_function overload that handles compilation so the kernels needs to be compiled first with ctx.compile_function, but otherwise it should work the same.

Let us know of that works for you!

bwibking · March 16, 2026, 8:39pm

Thanks, that will probably work. I’ll try it out once it lands in the nightly compiler.

bwibking · April 11, 2026, 9:42pm

Finally got to testing this. It works great! Thank you!

Topic		Replies	Views
Launching Mojo kernels on a specific CUDA/HIP stream? GPU Programming	1	86	March 12, 2026
Async Streaming from Device to Host Mojo discussion	0	77	July 2, 2025
Zero-copy GPU interop: export/import FD API for Mojo DeviceBuffer + DeviceContext Standard Library	0	124	August 23, 2025
Zero-copy DLPack interop GPU Programming	1	72	March 12, 2026
How to use DeviceContext GPU Programming	7	120	April 23, 2026

Proposal: Allow importing CUDA/HIP stream handles in DeviceContext

Related topics