DevicePassable closures: a requirement for HPC programmers

I’ve created a feature request here for making closures DevicePassable: [Feature Request] Make closures DevicePassable · Issue #6507 · modular/modular · GitHub.

I think this feature is going to be a requirement for HPC developers to take Mojo seriously. The vast majority of large-scale HPC codes have moved to using CUDA/HIP extended lambda-based programming models, since they enable compile-time portability (and debuggability) between CPU and GPU execution.

As an example, the parallel (8,000+ GPU production runs), large-scale (60k SLOC CUDA/HIP C++) physics code that I work on has 89 unique numerical kernels. Without a programming model that enables this kind of abstraction, it would not be possible to create or maintain it. This is a typical situation for comparable HPC codes.

Do you need value captures? Those are difficult to do since you need any backing data for the things you capture to be visible to the GPU.

For example:

from std.gpu import thread_idx, block_idx, block_dim
from std.gpu.host import DeviceContext, DeviceBuffer
from std.memory import UnsafePointer
from std.sys.info import is_gpu


def do_elementwise[dtype: DType, func: def(Scalar[dtype]) -> Scalar[dtype], on_gpu: Bool](ctx: DeviceContext, buffer: DeviceBuffer[dtype], size: Int) raises:
    """
    Invariant: `buffer` must have at least `size` elements.
    """

    comptime if on_gpu:
        def elementwise_kernel(ptr: UnsafePointer[Scalar[dtype], MutExternalOrigin], n: Int):
            idx = thread_idx.x + block_idx.x * block_dim.x
            if idx < n:
                ptr[idx] = func(ptr[idx])

        # Launch the kernel with appropriate grid and block dimensions
        block_size = 256
        grid_size = (size + block_size - 1) // block_size

        var kernel = ctx.compile_function[elementwise_kernel, elementwise_kernel]()
        
        ctx.enqueue_function(
            kernel,
            buffer,
            size,
            grid_dim=grid_size,
            block_dim=block_size,
        )
    else:
        with buffer.map_to_host() as mapped_buffer:
            var ptr = mapped_buffer.unsafe_ptr()
            for i in range(size):
                ptr[i] = func(ptr[i])

def square_and_add_constant[dtype: DType, constant: Scalar[dtype]](x: Scalar[dtype]) -> Scalar[dtype]:
    return x.fma(x, constant)

def main() raises:
    ctx = DeviceContext()

    var num_elements: Int = 16

    # Create a buffer in host (CPU) memory to store our input data
    host_buffer = ctx.enqueue_create_host_buffer[DType.float32](
        num_elements
    )

    ctx.synchronize()

    for i in range(num_elements):
        host_buffer[i] = Float32(i)

    device_buffer = ctx.enqueue_create_buffer[DType.float32](num_elements)

    ctx.enqueue_copy(src_buf=host_buffer, dst_buf=device_buffer)

    def double[dtype: DType](x: Scalar[dtype]) -> Scalar[dtype]:
        return x + x
    
    do_elementwise[DType.float32, double[DType.float32], True](ctx, device_buffer, num_elements)

    ctx.enqueue_copy(src_buf=device_buffer, dst_buf=host_buffer)

    ctx.synchronize()

    for i in range(num_elements):
        print(host_buffer[i])

    do_elementwise[DType.float32, square_and_add_constant[DType.float32, 3.14159], True](ctx, device_buffer, num_elements)

    ctx.enqueue_copy(src_buf=device_buffer, dst_buf=host_buffer)

    ctx.synchronize()

    for i in range(num_elements):
        print(host_buffer[i])

    def zero[dtype: DType](x: Scalar[dtype]) -> Scalar[dtype]:
        return Scalar[dtype](0)

    do_elementwise[DType.float32, zero[DType.float32], False](ctx, device_buffer, num_elements)

    ctx.enqueue_copy(src_buf=device_buffer, dst_buf=host_buffer)

    ctx.synchronize()
    
    for i in range(num_elements):
        print(host_buffer[i])

This code shows something lambda-like that doesn’t use value captures, and it compiles and works today. Mojo doesn’t have a lambda keyword since we’re trying to to figure out how to include some of the powers C++ lambdas give you without making the syntax messy for people who want simple “pure function” lambdas. Here, I do portability by having an on_gpu selector that holds kernel launches, but std.sys.info.is_gpu can be used in a similar way in the middle of kernels. Mojo’s type system lets you pass around functions at compile time, similar to how you might pass around a function with C++ templates. The combination of these two capabilities has given me what I’ve need for portability between GPU and CPU. If you have specific examples of things you want to do, I can try to help you out with translating them to Mojo.

If you want to have captures, that gets a bit tricky because we need to make sure that everything is allocated in memory the GPU has access to, which rapidly turns into a big pile of headaches. This is why closures aren’t device passable, because we don’t really have a way to guarantee that the List[Float32] that you captured is somewhere the GPU can see. If you have examples of places where you want to do this, I can probably provide a better direction for how to do what you want to do.

The code I want to enable all follows this basic pattern:

        # **********************************
        # MAIN TIME EVOLUTION LOOP
        # **********************************

        for step in range(1, nsteps + 1):
            phi_old.fill_boundary(geometry)

            var update_mfi = phi_old.mfiter()
            while update_mfi.is_valid():
                var bx = update_mfi.validbox()
                var phi_old_array = phi_old.array(update_mfi)
                var phi_new_array = phi_new.array(update_mfi)
                var tile_dx = dx.copy()

                def advance_cell(
                    i: Int, j: Int, k: Int
                ) raises {var phi_new_array^, var phi_old_array^, var tile_dx^, var dt,}:
                    phi_new_array[i, j, k] = phi_old_array[i, j, k] + dt * (
                        (phi_old_array[i + 1, j, k] - 2.0 * phi_old_array[i, j, k] + phi_old_array[i - 1, j, k])
                        / (tile_dx.x * tile_dx.x)
                        + (phi_old_array[i, j + 1, k] - 2.0 * phi_old_array[i, j, k] + phi_old_array[i, j - 1, k])
                        / (tile_dx.y * tile_dx.y)
                        + (phi_old_array[i, j, k + 1] - 2.0 * phi_old_array[i, j, k] + phi_old_array[i, j, k - 1])
                        / (tile_dx.z * tile_dx.z)
                    )

                ParallelFor(advance_cell, bx)
                update_mfi.next()

            time = time + dt
            phi_old.copy_from(phi_new, 0, 0, 1)

In this example, advance_cell requires a value capture for dt. That is a common pattern, and yes, in my code (and many others), value captures are absolutely necessary. My understanding of the way this analogous functionality works with CUDA is that dt and other value captures (for CUDA extended lambdas) are automatically translated into device function arguments by the compiler and copied to global memory by the runtime.

In Kokkos (and other CUDA C++ abstractions), switching between CPU/GPU is hidden inside ParallelFor. There is a CPU version of ParallelFor and GPU version of ParallelFor, and which one is used is selected based on ifdef macros that are set by the build system. The comptime if selector would be perfectly fine in Mojo for my purposes.

The issue is really that codes are written such that value captures are really necessary, since control flow is done in host code and sets variables accordingly that need to be copied implicitly for device kernels to use.

The other reason why the above pattern won’t work is that almost every kernel has a different set of work arrays it uses. For instance, if they were rewritten as free functions (that is, without capturing anything), all 89 kernels in my code would not conform to a single function signature (there are not quite 89 unique function signatures, but there are at least 20-30 unique function signatures). It is not feasible to have separate iterators/kernel launch helpers for every possible set of input arguments that all of the kernels need. So, as far as I understand how Mojo works, DevicePassable closures are the only way to handle this.

Precisely this kind of value capture is possible today in Rust using cuda-oxide to compile Rust code directly to nvptx kernels: GitHub - NVlabs/cuda-oxide: cuda-oxide is an experimental Rust-to-CUDA compiler that lets you write (SIMT) GPU kernels in safe(ish), idiomatic Rust. It compiles standard Rust code directly to PTX — no DSLs, no foreign language bindings, just Rust. · GitHub.

If anyone in Modular leadership is reading this: as you know, Mojo has the chance to make a lasting impression with the 1.0 final release. I cannot emphasize enough the fact that the HPC community will look past it without further thought if closures with value captures are not possible. I don’t know if Modular wants the HPC community to consider Mojo seriously, but if, as a company, you do, I cannot emphasize the fact that developers will expect extended-lambda-style value captures to work out of the box, in the same way shown in NVIDIA’s cuda-oxide README example.

For instance, if they were rewritten as free functions (that is, without capturing anything), all 89 kernels in my code would not conform to a single function signature

This problem can be fixed with a proper function trait family, and definitely should be.

I think the best solution to your problem right now might be to run the whole thing on the GPU, which I know can be annoying.

Let me go poke the feature request you added, because I don’t think this should be difficult.