How to package/interface with a GPU kernel with dynamic sized tensors (dynamic LayoutTensor)

Dynamic shapes are supported, but at the moment it is (as you’ve found) a little counter-intuitive. A LayoutTensor has to have a compile-time layout parameter. Because it’s a parameter, that layout is basically static: it’s part of the tensor’s type, in the same way that a List[Int] is not interchangeable with List[String]. You can specify that one or more dimensions are unknown using UNKNOWN_VALUE in place of an actual dimension. (Or as @TilliFe showed, you can use LayoutTensorBuild, but I’ll stick with the manual process for the sake of explanation.)

To specify a (dynamic) runtime layout, you need to create a RuntimeLayout object, which lets you specify a dynamic shape. The runtime layout also has a static layout parameter. The dynamic layout can have different dimensions from the static layout, but it has to have the same number of dimensions.

It looks like some elementwise operations (like tensor_a + tensor_b) don’t work with dynamic tensor shapes at the moment. (I can’t shed any light on that, but certainly file an issue.) But you can do a lot of basic operations, like accessing individual elements. These dynamic layout tensors are used in some of the MAX AI kernels.

(For example: modular/max/kernels/src/nn/mla.mojo at b06993f719c128f0ef528cafdb2b11102186893e · modular/modular · GitHub)

Here’s a simple example of invoking a kernel with a dynamically-sized layout tensor.

from layout import Layout, LayoutTensor, UNKNOWN_VALUE, RuntimeLayout
from sys import has_accelerator
from gpu.host import DeviceContext
from gpu import thread_idx, block_idx, global_idx, grid_dim, block_dim, barrier
from utils import Index


fn dynamic_layout_example():
    alias dtype = DType.int32
    alias in_size = 128
    alias block_size = 16
    num_blocks = in_size // block_size
    alias input_layout = Layout.row_major(UNKNOWN_VALUE, UNKNOWN_VALUE)

    fn kernel(tensor: LayoutTensor[dtype, input_layout, MutableAnyOrigin]):
        var width: Int = tensor.runtime_layout.dim(0)
        var height = tensor.runtime_layout.dim(1)
        # extract a tile from the input tensor.
        var tile = tensor.tile[block_size, block_size](block_idx.x, block_idx.y)
        if (global_idx.x < width and global_idx.y < height):
            tile[thread_idx.x, thread_idx.y] = global_idx.y + global_idx.x * grid_dim.x * block_dim.x

    try:
        var ctx = DeviceContext()
        var width: Int = 128
        var height: Int= 128
        var host_buf = ctx.enqueue_create_host_buffer[dtype](width * height)
        var dev_buf = ctx.enqueue_create_buffer[dtype](width * height)
        ctx.synchronize()
        for i in range(width * height):
            host_buf[i] = 1
        ctx.enqueue_copy(dev_buf, host_buf)
        var runtime_layout = RuntimeLayout[input_layout].row_major(Index(width, height))
        var tensor = LayoutTensor[dtype, input_layout](dev_buf, runtime_layout)

        ctx.enqueue_function[ 
            kernel,
        ](
            tensor,
            grid_dim=(num_blocks, num_blocks), 
            block_dim=(block_size, block_size),
        )
        ctx.synchronize()
        ctx.enqueue_copy(host_buf, dev_buf)
        ctx.synchronize()
        print(host_buf)
    except error:
        print(error)


def main():
    if has_accelerator():
        dynamic_layout_example()
    else:
        print("No accelerator")

I think the compute graphs are much happier with tensors that have static dimensions, but if you need dynamic dimensions, there are things you can do. (You can also just pass a DeviceBuffer[dtype] to enqueue_function, which is translated to an UnsafePointer[Scalar[dtype]] in the kernel signature. But as @owenhilyard said, you may be sacrificing portability.)