"No active MLIR context" with new `CustomOpLibrary` torch integration

bertaveira · May 21, 2025, 3:05pm

Hi, I know it is very new but I was trying out the new CustomOpLibrary torch interface. However I cannot get it to work with any example. I always get the same error:

File "/.../.venv/lib/python3.12/site-packages/max/graph/type.py", line 805, in to_mlir
    self.shape.to_mlir(), self.dtype, self.device.to_mlir()
    ^^^^^^^^^^^^^^^^^^^^
  File "/.../.venv/lib/python3.12/site-packages/max/graph/type.py", line 494, in to_mlir
    shape_type = mosh.ShapeType()
                 ^^^^^^^^^^^^^^^^
RuntimeError: No active MLIR context

As suggested on Discord I made a simple reprduceable example to post here. here is my setup. The folder structure is:

pyproject.toml
example.py
- kernels
--- __init__.mojo (empty)
--- kernel.mojo

pyproject.toml:

[project]
name = "example"
version = "0.0.0"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "torch>=2.6.0",
    "pillow>=11.2.1, <12",
    "modular>=25.4.0.dev2025052105",
]

[tool.uv]
[[tool.uv.index]]
url = "https://dl.modular.com/public/nightly/python/simple/"

example.py

from pathlib import Path
import torch
from max.torch import CustomOpLibrary

TILE_SIZE = 16
# Register Mojo kernels in Torch
mojo_kernels = Path(__file__).parent / "kernels"
op_library = CustomOpLibrary(mojo_kernels)
add_const_kernel = op_library.add_const[
    {
        "const": 10
    }
]

def add_const(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)
    add_const_kernel(result, x)
    return result

if __name__ == "__main__":
    x = torch.randn(10).cuda()

    print(add_const(x))

kernel.mojo

import compiler
from gpu import thread_idx, block_idx, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from tensor import InputTensor, OutputTensor

alias BLOCK_SIZE = 32
alias Dyn1DLayout = Layout.row_major(UNKNOWN_VALUE)
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        const: Int,
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[type = DType.float32, rank=1],
        # Inputs
        x: InputTensor[type = DType.float32, rank=1],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        @parameter
        if target == "cpu":
            raise Error("Rasterize3DGS CPU target not implemented yet.")
        elif target == "gpu":
            # Get GPU context
            var gpu_ctx = ctx.get_device_context()

            # Define grid and block dimensions for the kernel launch
            var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE))
            var block = (BLOCK_SIZE)

            gpu_ctx.enqueue_function[add_const_kernel[const]](
                x_tensor,
                result_tensor,
                grid_dim=grid,
                block_dim=block,
            )

        else:
            raise Error("Unsupported target:", target)


fn add_const_kernel[
    const: Int
](
    x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
):
    i = block_idx.x * BLOCK_SIZE + thread_idx.x
    result[i] = x[i] + const

I also tried without using the UNKOWN_VALUE but I always get the same issue. Anything I might be doing wrong here?

BradLarson · May 21, 2025, 11:17pm

So this worked for me in a modified version of your example:

operations/
    __init__.mojo
    add_one.mojo
example.py
mojoproject.toml

example.py

from pathlib import Path
import torch
from max.torch import CustomOpLibrary

# Register Mojo kernels in Torch
mojo_kernels = Path(__file__).parent / "operations"
op_library = CustomOpLibrary(mojo_kernels)
add_const_kernel = op_library.add_constant_custom[
    {
        "value": 10
    }
]

def add_const(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x)
    add_const_kernel(result, x)
    return result

if __name__ == "__main__":
    x = torch.randn(10).cuda()

    print(add_const(x))

add_one.mojo

import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor_internal import (
    InputTensor,
    ManagedTensorSlice,
    OutputTensor,
    foreach,
)

from utils.index import IndexList


@compiler.register("add_constant_custom")
struct AddConstantCustom[value: Int]:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        out: OutputTensor,
        x: InputTensor[type = out.type, rank = out.rank],
        ctx: DeviceContextPtr,
    ) raises:
        @parameter
        @always_inline
        fn add_constant[
            width: Int
        ](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
            return x.load[width](idx) + value

        foreach[add_constant, target=target](out, ctx)

    @staticmethod
    fn shape(
        x: InputTensor,
    ) raises -> IndexList[x.rank]:
        raise "NotImplemented"

mojoproject.toml

[project]
authors = ["Modular, Inc. <hello@modular.com>"]
channels = ["https://conda.modular.com/max-nightly", "https://conda.modular.com/max", "https://repo.prefix.dev/modular-community", "conda-forge", "pytorch"]
name = "pytorch-test"
platforms = ["linux-64"]
version = "0.1.0"

[tasks]
example = "python example.py"

[dependencies]
max = "*"
pytorch = {version = ">=2.5.0,<=2.7.0", channel = "pytorch"}

Now, that uses foreach over direct GPU functions, but it does seem to compile and run correctly. We’re looking into what’s going on in your specific functions, though.

Ehsan · May 21, 2025, 11:27pm

I’ve tried it with uv and its simplified version failed with CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered).

Repro uv init example && cd example. Then using

[project]
name = "example"
version = "0.0.0"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "torch>=2.6.0",
    "pillow>=11.2.1, <12",
    "modular>=25.4.0.dev2025052105",
]

[tool.uv]
[[tool.uv.index]]
url = "https://dl.modular.com/public/nightly/python/simple/"

and kernel.mojo

import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from gpu.host import DeviceBuffer
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer

alias BLOCK_SIZE = 32
alias Dyn1DLayout = Layout.row_major(32)
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[type = DType.float32, rank=1],
        # Inputs
        x: InputTensor[type = DType.float32, rank=1],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        @parameter
        if target == "cpu":
            raise Error("Rasterize3DGS CPU target not implemented yet.")
        elif target == "gpu":
            # Get GPU context
            var gpu_ctx = ctx.get_device_context()

            # Define grid and block dimensions for the kernel launch
            var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE))
            var block = (BLOCK_SIZE)

            gpu_ctx.enqueue_memset(
                DeviceBuffer[result.type](
                    gpu_ctx,
                    rebind[UnsafePointer[Scalar[result.type]]](result_tensor.ptr),
                    x.dim_size(0),
                    owning=False,
                ),
                0,
            )

            gpu_ctx.enqueue_function[add_const_kernel](
                x_tensor,
                result_tensor,
                grid_dim=grid,
                block_dim=block,
            )

        else:
            raise Error("Unsupported target:", target)


fn add_const_kernel(
    x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
):
    i = block_idx.x * block_dim.x + thread_idx.x
    if i < x.dim[0]():
        result[i] = x[i] + 10

and main.py

from pathlib import Path
import torch
from max.torch import CustomOpLibrary

mojo_kernels = Path(__file__).parent / "kernels"
op_library = CustomOpLibrary(mojo_kernels)
add_const_kernel = op_library.add_const

def add_const(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x, dtype=x.dtype, device=x.device)
    add_const_kernel(result, x)
    return result

if __name__ == "__main__":
    x = torch.randn(10).cuda()
    print(add_const(x))

then run uv run python main.py

stef · May 22, 2025, 12:33am

Hi Bernardo, thanks so much for the easy repro! I was able to reproduce and fix the issue, it might not get into the nightly today but if it doesn’t it will tomorrow.

In the meantime you can work around by adding

from max import mlir
mlir.Context().__enter__()

to your script anywhere before add_const_kernel = op_library.add_const[....

steepcurve · May 22, 2025, 6:50am

Combining Mojo&MAX with PyTorch is fascinating. I tried customizing a simple pattern matching for Inductor, and it worked. By the way, can Mojo kernels support AOT (Ahead-of-Time) compilation?

import compiler
from max.tensor import InputTensor, OutputTensor, foreach
from runtime.asyncrt import DeviceContextPtr

from utils.index import IndexList


@compiler.register("custom_pow2_add")
struct CustomPow2Add:
    @staticmethod
    def execute[
        target: StaticString
    ](
        output: OutputTensor,
        x: InputTensor[type = output.type, rank = output.rank],
        y: InputTensor[type = output.type, rank = output.rank],
        ctx: DeviceContextPtr,
    ):
        @parameter
        @always_inline
        fn run[width: Int](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
            return x.load[width](idx) ** 2 + y.load[width](idx)

        foreach[run, target=target](output, ctx)

import torch
from torch._inductor.pattern_matcher import (
    fwd_only,
    PatternMatcherPass,
    register_replacement,
)
from typing import Callable, Iterable

from pathlib import Path
from max.torch import CustomOpLibrary


mojo_kernels = Path(__file__).parent / "mojo_kernels"
op_library = CustomOpLibrary(mojo_kernels)
custom_pow2_add = op_library.custom_pow2_add


def custom_op(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    print("custum_op")
    result = torch.zeros_like(a)
    custom_pow2_add(result, a, b)
    return result


def pattern(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    c = a**2
    return c + b


patterns = PatternMatcherPass()
inputs = (torch.randn(10, 10), torch.randn(10, 10))
register_replacement(pattern, custom_op, inputs, fwd_only, patterns)

count = 0


def custom_pass(graph: torch.fx.graph):
    global count
    count = patterns.apply(graph)


def custom_backend(
    graph: torch.fx.GraphModule, example_inputs: Iterable[torch.Tensor]
) -> Callable:
    from torch._inductor import config

    current_config = config.get_config_copy()
    from torch._inductor.compile_fx import compile_fx

    current_config["post_grad_custom_post_pass"] = custom_pass
    return compile_fx(graph, example_inputs, config_patches=current_config)


@torch.compile(backend=custom_backend)
def f_mojo(x: torch.Tensor, y: torch.tensor) -> torch.Tensor:
    return x**2 + y


def f_torch(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    return x**2 + y


if __name__ == "__main__":
    inp1 = torch.rand(3, 5)
    inp2 = torch.rand(3, 5)
    print(f_mojo(inp1, inp2))
    print(f_torch(inp1, inp2))
    print(count)

bertaveira · May 22, 2025, 7:16am

Thank you all for the help. nightly seems to have fixed the No MLIR issue but then as pointed out by @Ehsan there was an illegal CUDA access issue. I tried running your fixed example with uv and now I get a new issue:

File "/.../.venv/lib/python3.12/site-packages/max/engine/api.py", line 526, in load
    _model = self._impl.compile_from_object(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Failed to run the MOToMGP pass manager:
open-source/max/max/kernels/src/Mogg/MOGGKernelAPI:1:1: error: failed to run the pass manager for offload functions
/.../kernels/kernel.mojo:67:20: error: call expansion failed
open-source/max/max/kernels/src/layout/layout_tensor.mojo:2337:8: note: function instantiation failed
open-source/max/max/kernels/src/layout/layout_tensor.mojo:2378:10: note: call expansion failed
note: constraint failed: This method only works with tensors that have depth-1 layouts (no nested shapes).
open-source/max/max/kernels/src/Mogg/MOGGKernelAPI:1:1: error: Could not elaborate the provided code: failed to run the pass manager
error: The graph compiler tried to JIT compile the provided kernels but failed during elaboration

The code is an exact replica of the one posted by @Ehsan except I changed to the latest nightly to fix the No MLIR issue: modular>=25.4.0.dev2025052116

bertaveira · May 22, 2025, 7:30am

Seems like the issue is now with this line:

if i < x.dim[0]():

Feeding this size into the function as an argument fixes it but I wonder why this is a problem?

bertaveira · May 22, 2025, 1:15pm

So as mentioned before could not check the LayoutTensor dimension inside the kernel function for some reason. nevertheless now I was experimenting with adding one dimension and I cannot make it work without getting:

CUDA call failed: CUDA_ERROR_ILLEGAL_ADDRESS (an illegal memory access was encountered)

I am wondering if I bumped into a new bug or if I am doing something wrong. I modified this simple example and verified the same happened on this very simple example. On other examples I was experiemnting with anything I try to index a LayoutTensor with over 1 dimension this seems to happen.

kernel.mojo

import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from gpu.host import DeviceBuffer
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer

alias BLOCK_SIZE = 32
alias Dyn2DLayout = Layout.row_major(UNKNOWN_VALUE, UNKNOWN_VALUE)
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[type = DType.float32, rank=2],
        # Inputs
        x: InputTensor[type = DType.float32, rank=2],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        @parameter
        if target == "cpu":
            raise Error("Rasterize3DGS CPU target not implemented yet.")
        elif target == "gpu":
            # Get GPU context
            var gpu_ctx = ctx.get_device_context()

            # Define grid and block dimensions for the kernel launch
            var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE), ceildiv(x.dim_size(1), BLOCK_SIZE))
            var block = (BLOCK_SIZE, BLOCK_SIZE)

            gpu_ctx.enqueue_function[add_const_kernel](
                x_tensor,
                result_tensor,
                x.dim_size(0),
                grid_dim=grid,
                block_dim=block,
            )

        else:
            raise Error("Unsupported target:", target)


fn add_const_kernel(
    x: LayoutTensor[dtype, Dyn2DLayout, MutableAnyOrigin],
    result: LayoutTensor[dtype, Dyn2DLayout, MutableAnyOrigin],
    size: Int,
):
    i = block_idx.x * block_dim.x + thread_idx.x
    j = block_idx.y * block_dim.y + thread_idx.y
    if i < size and j < size:
        result[i, j] = x[i, j] + 10

example.py

from pathlib import Path
import torch
from max.torch import CustomOpLibrary

mojo_kernels = Path(__file__).parent / "kernels"
op_library = CustomOpLibrary(mojo_kernels)
add_const_kernel = op_library.add_const

def add_const(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x, dtype=x.dtype, device=x.device)
    add_const_kernel(result, x)
    return result

if __name__ == "__main__":
    x = torch.randn(10,10).cuda()
    print(add_const(x))

pyproject.toml

[project]
name = "example"
version = "0.0.0"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "torch>=2.6.0",
    "pillow>=11.2.1, <12",
    "modular>=25.4.0.dev2025052116",
]

[tool.uv]
[[tool.uv.index]]
url = "https://dl.modular.com/public/nightly/python/simple/"

Ehsan · May 23, 2025, 2:34am

Good news with the new nightly (rm uv.lock in case) it works as expected!

import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from gpu.host import DeviceBuffer
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer

alias BLOCK_SIZE = 32
alias Dyn1DLayout = Layout.row_major(UNKNOWN_VALUE)
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[type = DType.float32, rank=1],
        # Inputs
        x: InputTensor[type = DType.float32, rank=1],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        @parameter
        if target == "cpu":
            raise Error("Rasterize3DGS CPU target not implemented yet.")
        elif target == "gpu":
            # Get GPU context
            var gpu_ctx = ctx.get_device_context()

            # Define grid and block dimensions for the kernel launch
            var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE))
            var block = (BLOCK_SIZE)

            gpu_ctx.enqueue_memset(
                DeviceBuffer[result.type](
                    gpu_ctx,
                    rebind[UnsafePointer[Scalar[result.type]]](result_tensor.ptr),
                    x.dim_size(0),
                    owning=False,
                ),
                0,
            )

            gpu_ctx.enqueue_function[add_const_kernel](
                x_tensor,
                result_tensor,
                grid_dim=grid,
                block_dim=block,
            )

        else:
            raise Error("Unsupported target:", target)


fn add_const_kernel(
    x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
):
    i = block_idx.x * block_dim.x + thread_idx.x
    if i < x.dim[0]():
        result[i] = x[i] + 10

however, I get

ValueError: Failed to run the MOToMGP pass manager:
open-source/max/max/kernels/src/Mogg/MOGGKernelAPI:1:1: error: failed to run the pass manager for offload functions
/home/ubuntu/workspace/tmp/example/kernels/kernel.mojo:63:4: error: call expansion failed
/home/ubuntu/workspace/tmp/example/kernels/kernel.mojo:63:4: note: function instantiation failed
/home/ubuntu/workspace/tmp/example/kernels/kernel.mojo:68:20: note: call expansion failed
open-source/max/max/kernels/src/layout/layout_tensor.mojo:2337:8: note: function instantiation failed
open-source/max/max/kernels/src/layout/layout_tensor.mojo:2378:10: note: call expansion failed
note: constraint failed: This method only works with tensors that have depth-1 layouts (no nested shapes).
open-source/max/max/kernels/src/Mogg/MOGGKernelAPI:1:1: error: Could not elaborate the provided code: failed to run the pass manager
error: The graph compiler tried to JIT compile the provided kernels but failed during elaboration

for the original case with const

Ehsan · June 4, 2025, 3:12am

@bertaveira it turns out the main issue is in dynamic layout since UNKNOWN_VALUE is represented as -1 then this ceildiv(x.dim_size(0), BLOCK_SIZE) makes the grid_dim to be zero so is an invalid error. We’ll be working on showing much better error messages. In the meantime, dynamic tensors can’t work like that and need to directly specify the grid_dim.

bertaveira · June 4, 2025, 9:08am

But then there is no way to get the size of a dynamic tensor in runtime? We have to pass the sizes always as argument? Doesn’t that somewhat defeat the niceness of using layout tensors to begin with? Layout tensors will know in runtime since they have to so why can’t we access that information?

bertaveira · June 12, 2025, 7:35am

Any developments or ideas how to solve this of get ingthe shape of an InputTensor with UNKOWN_VALUE shape?

I was waiting to see if it was added in a nightly and also tried to feed it as an int argument. But apparently with CustomOps one can only have InputTensors and no other arguments. The only way I see is to make it a parameter. But that defeats the whole point of using UNKOWN_VALUE.

Am I missing something here? Seems like a pretty huge gap that makes it impossible to trully have runtime size tensors with custom ops since this even prevents us from deciding the grid size since there is no way to know the dimensions of the UNKOWN_VALUE inputs

Ehsan · June 12, 2025, 6:22pm

These are great questions! I’m delegating to @stef who’s dug into this earlier.

stef · June 12, 2025, 7:54pm

@bertaveira thanks for following up! I think some wires got crossed in our response here.

As you point out, using a dynamic shape value at runtime is a really common use case, and is the entire point of a dynamic layout!

The dim functions on LayoutTensor provide access to dynamic dimension values
Be careful to note that x.shape[dim]() is static and does not know the dynamic layout value! This is what @Ehsan was referring to in his previous post, it will return -1 for an UNKNOWN_VALUE
ManagedTensorSlice works a bit differently, where x.shape() returns an IndexList, x.dim_size[dim]() returns statically-known dimension sizes, and x.dim_size(dim) returns dynamically-known dimension sizes.

In the code you posted this is not a problem, but was a thing I happened to stumble on during reproduction. As long as you use one of the dimension functions which reports dynamic size (which you were!), you won’t run into this problem. We’re also merging better error checking for this so it will fail with a good error message rather than reporting a generic GPU memory error

The specific issue you were encountering in this example was a subtle bug in our compiler stack related to unsafely capturing values which had no GPU representation and trying to send them across the device boundary, which then caused a GPU memory error when running the function. This has been fixed and your example should work in the latest nightly. Please update if you’re still having trouble!

FYI we also recently open sourced our entire kernel library with thousands of high performance examples You can see plenty of usages of many of these features on github, for instance here are usages of grid_dim, here’s an example of a multi-head attention kernel with a dynamic batch size as part of the grid_dim, and here’s an example of a fused attention implementation which uses a parameterized kernel to use “dynamic” static shape info for “grid_dim” – it will compile a specialized implementation per input layout.

stef · June 12, 2025, 8:02pm

(Also, keep an eye out for scalar inputs to custom ops coming to a nightly near you soon! though you definitely don’t need it for passing dynamic shape info!)

bertaveira · June 13, 2025, 12:20pm

Thank you for the comprehensive explanation. It makes a lot of sense now. The dim_size is indeed not the issue. however this simple example is still not working with the same problem (CUDA_ERROR_ILLEGAL_ADDRESS).

The grid and block sizes seem ok but if one tries to print or do anything indexing the tensor it throws this error CUDA_ERROR_ILLEGAL_ADDRESS.

For sanity sake here is the exact code being run again. I am running on the absoilute latest nightly (dev2025061222)

kernel.mojo:

import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from gpu.host import DeviceBuffer
from max.tensor import InputTensor, OutputTensor
from memory import UnsafePointer

alias BLOCK_SIZE = 32
alias Dyn1DLayout = Layout.row_major(UNKNOWN_VALUE)
alias Dyn2DLayout = Layout.row_major(UNKNOWN_VALUE, UNKNOWN_VALUE)
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[dtype = DType.float32, rank=1],
        # Inputs
        x: InputTensor[dtype = DType.float32, rank=1],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        @parameter
        if target == "cpu":
            raise Error("Rasterize3DGS CPU target not implemented yet.")
        elif target == "gpu":
            # Get GPU context
            var gpu_ctx = ctx.get_device_context()

            # Define grid and block dimensions for the kernel launch
            var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE))
            var block = (BLOCK_SIZE)
            print("grid:", grid, "block:", block)

            gpu_ctx.enqueue_memset(
                DeviceBuffer[result.dtype](
                    gpu_ctx,
                    rebind[UnsafePointer[Scalar[result.dtype]]](result_tensor.ptr),
                    x.dim_size(0),
                    owning=False,
                ),
                0,
            )

            gpu_ctx.enqueue_function[add_const_kernel](
                x_tensor,
                result_tensor,
                x.dim_size(0),
                grid_dim=grid,
                block_dim=block,
            )

        else:
            raise Error("Unsupported target:", target)


fn add_const_kernel(
    x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    size: Int,
):
    i = block_idx.x * block_dim.x + thread_idx.x
    if i < size:
        result[i] = x[i] + 10

example.py:

from pathlib import Path
import torch
from max.torch import CustomOpLibrary

mojo_kernels = Path(__file__).parent / "kernels"
op_library = CustomOpLibrary(mojo_kernels)
add_const_1d_kernel = op_library.add_const

def add_const_1d(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x, dtype=x.dtype, device=x.device)
    add_const_1d_kernel(result, x)
    return result

if __name__ == "__main__":
    x = torch.randn(10).cuda()
    print(add_const_1d(x))

No matter what I do inside the kernel if I try accessing any of the variables I get an error… Is this a bug somewhere? Should I make a github issue? I have never made any example using UNKOWN_VALUE work so I am not sure if this is a bug or just not yet supported.

stef · June 13, 2025, 8:55pm

I’m still not sure why this is, I’ve filed another issue, there’s definitely another bad bug somewhere that we need to squash. It’s true that we just don’t use UNKNOWN_VALUE very much in our kernel library.

Here’s a working example that parameterizes:

import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from max.tensor import InputTensor, OutputTensor

alias BLOCK_SIZE = 32
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        result: OutputTensor[dtype = dtype, rank=1],
        x: InputTensor[dtype = dtype, rank=1],
        ctx: DeviceContextPtr,
    ) raises:
        constrained[target == "gpu", "not implemented"]()
        var gpu_ctx = ctx.get_device_context()
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        alias kernel = add_const_kernel[x_tensor.layout, result_tensor.layout]

        gpu_ctx.enqueue_function[kernel](
            x_tensor,
            result_tensor,
            grid_dim=ceildiv(x.dim_size(0), BLOCK_SIZE),
            block_dim=BLOCK_SIZE,
        )


fn add_const_kernel[x_layout: Layout, result_layout: Layout](
    x: LayoutTensor[dtype, x_layout, MutableAnyOrigin],
    result: LayoutTensor[dtype, result_layout, MutableAnyOrigin],
):
    i = block_idx.x * block_dim.x + thread_idx.x
    if i < x.dim(0):
        result[i] = x[i] + 10

I’ll follow up when we figure out what’s going on with UNKNOWN_VALUE.

spenser · July 8, 2025, 2:07pm

The UNKNOWN_VALUE issue is due to an error in the Mojo code related to the use of ctx.enqueue_function. The root cause is that ctx.enqueue_function does not perform type checking of its arguments where a GPU kernel is invoked. The user’s kernel looks like this:

  fn add_const_kernel(
      x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
      result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
      size: Int,
  ):
      i = block_idx.x * block_dim.x + thread_idx.x
      if i < size:
          result[i] = x[i] + 10

And is enqueued by this code:

gpu_ctx.enqueue_function[add_const_kernel](
     x_tensor,
     result_tensor,
     x.dim_size(0),
     grid_dim=grid,
     block_dim=block,
)

Here, the types of x_tensor and result_tensor do not have a layout which matches Dyn1DLayout. You can see this directly by creating a local variable with the type LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin] and attempting to assign x_tensor to it.

The error in this case is subtle and the lack of good messaging makes it difficult for users to diagnose. These restrictions are noted briefly in Get started with GPU programming | Modular.

The static layouts of tensors from the graph compiler are difficult to know beforehand.
The easiest way to rewrite the example is to just make add_const_kernel layout agnostic by parameterizing on layout and pass the layout directly:

 fn add_const_kernel[
    x_layout: Layout,
    result_layout: Layout,
](
      x: LayoutTensor[dtype, x_layout, MutableAnyOrigin],
      result: LayoutTensor[dtype, result_layout, MutableAnyOrigin],
      size: Int,
  ):
      i = block_idx.x * block_dim.x + thread_idx.x
      if i < size:
          result[i] = x[i] + 10

gpu_ctx.enqueue_function[
    add_const_kernel[
        x_layout = x_tensor.layout,
        result_layout = result_tensor.layout,
    ]
](
     x_tensor,
     result_tensor,
     x.dim_size(0),
     grid_dim=grid,
     block_dim=block,
)

bertaveira · July 8, 2025, 4:18pm

I feel like this thread is going in circles. The whole point of this thread was that some applications have a runtime variable size of tensors. For example Gaussian Splatting where the number of Gaussians change as they get added and removed.

In these situations you don’t want the layout to be a parameter since it would trigger a recompilation of the kernel at every iteration. That is why the UNKOWN_VALUE is relevant here. I have tried every solution in the book and it seems not possible to have variable sizes without triggering a recompilation with the current versions of Mojo or am I missing something?

I essentially ported the Gaussian Splatting kernels to mojo and wanted to benchmark them against CUDA, Triton, Mamba… and wanted to release them as a library like GSplat so anyone could use them instead of GSplat which is a bit of a pain to deal with since it uses the Torch CUDA extensions which don’t work well with package managers like UV or even pyproject.toml dependency based definitions.

I guess I will keep coming back every few weeks and trying to see if any changes have been made to mojo to make this possible at all. Let me know if there is a way forward at some point. Alternative solutions like having layouts parametersized or fixed sizes are just workarounds not really tackling the issue exposed here

bertaveira · July 8, 2025, 4:29pm

Also just saw the related issue ticket on GitHub was closed but this is not fixed at all right? I don’t think this workaround which has drastically different implications merits closing the issue

Topic		Replies	Views
How to package/interface with a GPU kernel with dynamic sized tensors (dynamic LayoutTensor) GPU Programming	15	310	July 12, 2025
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1199	April 9, 2025
GPU kernel compilation error MAX	2	107	May 30, 2025
Migrate from Tensor to TensorLayout? Mojo discussion , 25_2	5	212	April 9, 2025
Still struggling with type errors in Mojo around LayoutTensor and tiles Mojo	7	110	May 20, 2025

"No active MLIR context" with new `CustomOpLibrary` torch integration

Related topics