"No active MLIR context" with new `CustomOpLibrary` torch integration

spenser · July 8, 2025, 6:18pm

Sorry if I’ve added to the confusion here. There is more than one issue mentioned in this thread, so it’s a little confusing to keep track.

I guess I will keep coming back every few weeks and trying to see if any changes have been made to mojo to make this possible at all. Let me know if there is a way forward at some point. Alternative solutions like having layouts parametersized or fixed sizes are just workarounds not really tackling the issue exposed here.

The way Mojo handles the example add_const function is not great at the moment. We’re working on improving ctx.enqueue_function so that it should produce a type error, rather than failing silently at runtime. It is unlikely we make any changes in the near term that allow add_const to be used like that with ctx.enqueue_function.

That was my reasoning for closing the associated GitHub ticket. I’m happy to re-open if you feel this is a bug on our side.

Also just saw the related issue ticket on GitHub was closed but this is not fixed at all right? I don’t think this workaround which has drastically different implications merits closing the issue

I did not realize the use of UNKNOWN_VALUE was a must have for your example. It is definitely possible to keep that structure and the kernel use an unknown layout. The best way to do that would be to encode that constraint on the kernel registration itself, rather than just the function passed to ctx.enqueue_function.

from compiler_internal import StaticTensorSpec

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[
            static_spec=StaticTensorSpec[DType.float32, 1].create_unknown(),
        ],
        # Inputs
        x: InputTensor[
            static_spec=StaticTensorSpec[DType.float32, 1].create_unknown(),
        ],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
		# Rest of the code stays the same

Fair warning, you may need to update your modular dependency. The code above was tripping a different bug in the compiler that has since been fixed.

This could be made easier if we had some convenience functions to ‘erase’ the static layout of LayoutTensor or ManagedTensorSlice, but we don’t have those at the moment. Erasing the layout on the InputTensor and OutputTensor arguments will ensure that no code gets specialized on layouts, rather than just the GPU kernel.

bertaveira · July 16, 2025, 9:42am

Thank you for the reply. That did make the code run but it is compiling a seperate kernel on every size change. It is very obvious when running this code that it compiles for size 10 and does the 2 runs and then compiles for each very slowly.

I had to run rm -rf ~/.modular between tries to ensure it is not caching kernels.

example.py:

from pathlib import Path
import torch
from max.torch import CustomOpLibrary

mojo_kernels = Path(__file__).parent / "kernels"
op_library = CustomOpLibrary(mojo_kernels)
add_const_kernel = op_library.add_const

def add_const_1d(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x, dtype=x.dtype, device=x.device)
    add_const_kernel(result, x)
    return result

if __name__ == "__main__":
    x = torch.randn(10).cuda()
    print(add_const_1d(x))

    x = torch.randn(10).cuda()
    print(add_const_1d(x))

    x = torch.randn(2).cuda()
    print(add_const_1d(x))

    x = torch.randn(3).cuda()
    print(add_const_1d(x))

    x = torch.randn(4).cuda()
    print(add_const_1d(x))

    x = torch.randn(5).cuda()
    print(add_const_1d(x))

    x = torch.randn(6).cuda()
    print(add_const_1d(x))

kernels/kernel.mojo:

import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from gpu.host import DeviceBuffer
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from compiler_internal import StaticTensorSpec

alias BLOCK_SIZE = 32
alias Dyn1DLayout = Layout.row_major(UNKNOWN_VALUE)
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[static_spec=StaticTensorSpec[DType.float32, 1].create_unknown()],
        # Inputs
        x: InputTensor[static_spec=StaticTensorSpec[DType.float32, 1].create_unknown()],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        @parameter
        if target == "cpu":
            raise Error("Rasterize3DGS CPU target not implemented yet.")
        elif target == "gpu":
            # Get GPU context
            var gpu_ctx = ctx.get_device_context()

            # Define grid and block dimensions for the kernel launch
            var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE))
            var block = (BLOCK_SIZE)

            gpu_ctx.enqueue_memset(
                DeviceBuffer[result.dtype](
                    gpu_ctx,
                    rebind[UnsafePointer[Scalar[result.dtype]]](result_tensor.ptr),
                    x.dim_size(0),
                    owning=False,
                ),
                0,
            )

            gpu_ctx.enqueue_function[add_const_kernel](
                x_tensor,
                result_tensor,
                x.dim_size(0),
                grid_dim=grid,
                block_dim=block,
            )
        else:
            raise Error("Unsupported target:", target)

fn add_const_kernel(
    x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    size: Int,
):
    i = block_idx.x * block_dim.x + thread_idx.x
    if i < size:
        result[i] = x[i] + 10

When running (wiith modular==25.5.0.dev2025071605) it is very noticeable it recompiles for every new size. Am I missing something?

spenser · July 16, 2025, 2:47pm

You’re not missing anything at the moment. The size specialization + caching is how the system is currently intended to work. The behavior is a consequence of how we build a MAX graph based on the runtime shape of each input tensor.

We’ve discussed ways to give user control over this behavior, but this is the first concrete usecase we’ve had for such a feature.

It would be nice if we could pick up on kernel arguments like OutputTensor[static_spec=StaticTensorSpec[DType.float32, 1].create_unknown()] and realize there is no point in doing size specialization, but we don’t currently have the tooling to inspect kernels like that at the Python level.

In the short-term, we could expose a config option to disable size specialization, but there is no immediate workaround for your use case.

stef · July 16, 2025, 3:58pm

You can check out the more general graph_op which allows specifying input_types specifically for this case!

bertaveira · July 21, 2025, 8:47am

Sounds good! A config to disable size specification would do the trick if you think it is an acceptable addition from your side in the short term.

Topic		Replies	Views
How to package/interface with a GPU kernel with dynamic sized tensors (dynamic LayoutTensor) GPU Programming	15	480	July 12, 2025
MAX 26.1: eager to compile contract, lowering pipeline, kernel selection across GPUs, and extension points for custom ops Mojo discussion	8	198	February 3, 2026
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1708	April 9, 2025
GPU kernel compilation error MAX	2	189	May 30, 2025
Migrate from Tensor to TensorLayout? Mojo discussion , 25_2	5	360	April 9, 2025

"No active MLIR context" with new `CustomOpLibrary` torch integration

Related topics