DevicePassable closures: a requirement for HPC programmers

I’ve created a feature request here for making closures DevicePassable: [Feature Request] Make closures DevicePassable · Issue #6507 · modular/modular · GitHub.

I think this feature is going to be a requirement for HPC developers to take Mojo seriously. The vast majority of large-scale HPC codes have moved to using CUDA/HIP extended lambda-based programming models, since they enable compile-time portability (and debuggability) between CPU and GPU execution.

As an example, the parallel (8,000+ GPU production runs), large-scale (60k SLOC CUDA/HIP C++) physics code that I work on has 89 unique numerical kernels. Without a programming model that enables this kind of abstraction, it would not be possible to create or maintain it. This is a typical situation for comparable HPC codes.

Do you need value captures? Those are difficult to do since you need any backing data for the things you capture to be visible to the GPU.

For example:

from std.gpu import thread_idx, block_idx, block_dim
from std.gpu.host import DeviceContext, DeviceBuffer
from std.memory import UnsafePointer
from std.sys.info import is_gpu


def do_elementwise[dtype: DType, func: def(Scalar[dtype]) -> Scalar[dtype], on_gpu: Bool](ctx: DeviceContext, buffer: DeviceBuffer[dtype], size: Int) raises:
    """
    Invariant: `buffer` must have at least `size` elements.
    """

    comptime if on_gpu:
        def elementwise_kernel(ptr: UnsafePointer[Scalar[dtype], MutExternalOrigin], n: Int):
            idx = thread_idx.x + block_idx.x * block_dim.x
            if idx < n:
                ptr[idx] = func(ptr[idx])

        # Launch the kernel with appropriate grid and block dimensions
        block_size = 256
        grid_size = (size + block_size - 1) // block_size

        var kernel = ctx.compile_function[elementwise_kernel, elementwise_kernel]()
        
        ctx.enqueue_function(
            kernel,
            buffer,
            size,
            grid_dim=grid_size,
            block_dim=block_size,
        )
    else:
        with buffer.map_to_host() as mapped_buffer:
            var ptr = mapped_buffer.unsafe_ptr()
            for i in range(size):
                ptr[i] = func(ptr[i])

def square_and_add_constant[dtype: DType, constant: Scalar[dtype]](x: Scalar[dtype]) -> Scalar[dtype]:
    return x.fma(x, constant)

def main() raises:
    ctx = DeviceContext()

    var num_elements: Int = 16

    # Create a buffer in host (CPU) memory to store our input data
    host_buffer = ctx.enqueue_create_host_buffer[DType.float32](
        num_elements
    )

    ctx.synchronize()

    for i in range(num_elements):
        host_buffer[i] = Float32(i)

    device_buffer = ctx.enqueue_create_buffer[DType.float32](num_elements)

    ctx.enqueue_copy(src_buf=host_buffer, dst_buf=device_buffer)

    def double[dtype: DType](x: Scalar[dtype]) -> Scalar[dtype]:
        return x + x
    
    do_elementwise[DType.float32, double[DType.float32], True](ctx, device_buffer, num_elements)

    ctx.enqueue_copy(src_buf=device_buffer, dst_buf=host_buffer)

    ctx.synchronize()

    for i in range(num_elements):
        print(host_buffer[i])

    do_elementwise[DType.float32, square_and_add_constant[DType.float32, 3.14159], True](ctx, device_buffer, num_elements)

    ctx.enqueue_copy(src_buf=device_buffer, dst_buf=host_buffer)

    ctx.synchronize()

    for i in range(num_elements):
        print(host_buffer[i])

    def zero[dtype: DType](x: Scalar[dtype]) -> Scalar[dtype]:
        return Scalar[dtype](0)

    do_elementwise[DType.float32, zero[DType.float32], False](ctx, device_buffer, num_elements)

    ctx.enqueue_copy(src_buf=device_buffer, dst_buf=host_buffer)

    ctx.synchronize()
    
    for i in range(num_elements):
        print(host_buffer[i])

This code shows something lambda-like that doesn’t use value captures, and it compiles and works today. Mojo doesn’t have a lambda keyword since we’re trying to to figure out how to include some of the powers C++ lambdas give you without making the syntax messy for people who want simple “pure function” lambdas. Here, I do portability by having an on_gpu selector that holds kernel launches, but std.sys.info.is_gpu can be used in a similar way in the middle of kernels. Mojo’s type system lets you pass around functions at compile time, similar to how you might pass around a function with C++ templates. The combination of these two capabilities has given me what I’ve need for portability between GPU and CPU. If you have specific examples of things you want to do, I can try to help you out with translating them to Mojo.

If you want to have captures, that gets a bit tricky because we need to make sure that everything is allocated in memory the GPU has access to, which rapidly turns into a big pile of headaches. This is why closures aren’t device passable, because we don’t really have a way to guarantee that the List[Float32] that you captured is somewhere the GPU can see. If you have examples of places where you want to do this, I can probably provide a better direction for how to do what you want to do.

The code I want to enable all follows this basic pattern:

        # **********************************
        # MAIN TIME EVOLUTION LOOP
        # **********************************

        for step in range(1, nsteps + 1):
            phi_old.fill_boundary(geometry)

            var update_mfi = phi_old.mfiter()
            while update_mfi.is_valid():
                var bx = update_mfi.validbox()
                var phi_old_array = phi_old.array(update_mfi)
                var phi_new_array = phi_new.array(update_mfi)
                var tile_dx = dx.copy()

                def advance_cell(
                    i: Int, j: Int, k: Int
                ) raises {var phi_new_array^, var phi_old_array^, var tile_dx^, var dt,}:
                    phi_new_array[i, j, k] = phi_old_array[i, j, k] + dt * (
                        (phi_old_array[i + 1, j, k] - 2.0 * phi_old_array[i, j, k] + phi_old_array[i - 1, j, k])
                        / (tile_dx.x * tile_dx.x)
                        + (phi_old_array[i, j + 1, k] - 2.0 * phi_old_array[i, j, k] + phi_old_array[i, j - 1, k])
                        / (tile_dx.y * tile_dx.y)
                        + (phi_old_array[i, j, k + 1] - 2.0 * phi_old_array[i, j, k] + phi_old_array[i, j, k - 1])
                        / (tile_dx.z * tile_dx.z)
                    )

                ParallelFor(advance_cell, bx)
                update_mfi.next()

            time = time + dt
            phi_old.copy_from(phi_new, 0, 0, 1)

In this example, advance_cell requires a value capture for dt. That is a common pattern, and yes, in my code (and many others), value captures are absolutely necessary. My understanding of the way this analogous functionality works with CUDA is that dt and other value captures (for CUDA extended lambdas) are automatically translated into device function arguments by the compiler and copied to global memory by the runtime.

In Kokkos (and other CUDA C++ abstractions), switching between CPU/GPU is hidden inside ParallelFor. There is a CPU version of ParallelFor and GPU version of ParallelFor, and which one is used is selected based on ifdef macros that are set by the build system. The comptime if selector would be perfectly fine in Mojo for my purposes.

The issue is really that codes are written such that value captures are really necessary, since control flow is done in host code and sets variables accordingly that need to be copied implicitly for device kernels to use.

The other reason why the above pattern won’t work is that almost every kernel has a different set of work arrays it uses. For instance, if they were rewritten as free functions (that is, without capturing anything), all 89 kernels in my code would not conform to a single function signature (there are not quite 89 unique function signatures, but there are at least 20-30 unique function signatures). It is not feasible to have separate iterators/kernel launch helpers for every possible set of input arguments that all of the kernels need. So, as far as I understand how Mojo works, DevicePassable closures are the only way to handle this.

Precisely this kind of value capture is possible today in Rust using cuda-oxide to compile Rust code directly to nvptx kernels: GitHub - NVlabs/cuda-oxide: cuda-oxide is an experimental Rust-to-CUDA compiler that lets you write (SIMT) GPU kernels in safe(ish), idiomatic Rust. It compiles standard Rust code directly to PTX — no DSLs, no foreign language bindings, just Rust. · GitHub.

If anyone in Modular leadership is reading this: as you know, Mojo has the chance to make a lasting impression with the 1.0 final release. I cannot emphasize enough the fact that the HPC community will look past it without further thought if closures with value captures are not possible. I don’t know if Modular wants the HPC community to consider Mojo seriously, but if, as a company, you do, I cannot emphasize the fact that developers will expect extended-lambda-style value captures to work out of the box, in the same way shown in NVIDIA’s cuda-oxide README example.

For instance, if they were rewritten as free functions (that is, without capturing anything), all 89 kernels in my code would not conform to a single function signature

This problem can be fixed with a proper function trait family, and definitely should be.

I think the best solution to your problem right now might be to run the whole thing on the GPU, which I know can be annoying.

Let me go poke the feature request you added, because I don’t think this should be difficult.

I think it is kind of wrong to say that the CUDA/Kokkos way has to be the only way or optimum way to approach it for HPC.

I can provide an example of a highly performant and portable runtime and programming model for HPC that disallows both DevicePassable closures, and extended-lambda style captures: AdaptiveCpp. It is explicitly built around C++ 17, and to be compatible with C++ standard parallelism without needing language extensions like CUDA/Kokkos do.

In CUDA, the compiler does extra work to make sure a host-defined lambda’s metadata is available on the device. In AdaptiveCpp, because it’s standard C++17, the closure must be a value-capture [=] lambda. If the lambda captures anything that isn’t trivially copyable (like a std::string or a complex host pointer), it will fail; not because of a device-passable restriction, but because it violates the “trivially copyable” requirement for device data movement.
AdaptiveCpp’s SSCP (single-source, single compiler pass) parses the code exactly once, produces a backend-independent LLVM IR representation, and then lowers to PTX, SPIR-V, or amdgcn at JIT-compile time rather than at source-compile time. This is very different from CUDA’s multipass approach.
In explicit multipass flows, kernel images are embedded in the host binary and extracted at runtime, with the kernel invoked using runtime functions rather than language extensions. The key design goal of the generic SSCP path is that you never need _device_ annotations. The compiler infers what is device code from the call graph rooted at parallel_for or single_task.

But closures are still allowed — they’re just constrained. It resolves the “what can the GPU see?” problem through two interlocking mechanisms:

  1. Mandatory capture-by-value. The kernel lambda must use copy for all of its captures (i.e. [=]), and the lambda must not use the mutable specifier. This means the closure is serialized as a value, not as a reference into host stack memory. The entire captured lambda object becomes kernel arguments.
  2. The is_device_copyable concept. This is the SYCL answer to your CUDA/Kokkos concern. Non-trivial types may only be passed as kernel arguments if they adhere to the device copyable concept, which requires specializing the is_device_copyable type trait for user-defined types. This means that a type like List[Float32] which internally holds a heap pointer to host memory — cannot be captured in a kernel lambda unless you explicitly tell the type system it is device-safe. The AdaptiveCpp compiler checks and enforces this.

The coupling between host control flow and device execution is architecturally baked in to the Kokkos/CUDA model, while it is not in AdaptiveCpp. It effectively makes the constraint a type-system property rather than a programmer convention. The HPC programmer has to either use a memory-safe wrapper (accessor or USM pointer) or explicitly opt their type into device copyability, accepting the responsibility. The EuroHPC stack has been built on this model and it completely decouples from NVIDIA/CUDA dependencies.

I mentioned CUDA’s language features as an example of a language/runtime where it is possible to capture by value and ship the closure to the device, not as the only possible implementation. It serves purely as a point of contrast to Mojo today, where that is just not possible at all.

I think your example illustrates the broader point – HPC applications today expect to have some way to do this, regardless of the underlying compiler or runtime.