"No active MLIR context" with new `CustomOpLibrary` torch integration

Hi @bertaveira,

Sorry if I’ve added to the confusion here. There is more than one issue mentioned in this thread, so it’s a little confusing to keep track.

I guess I will keep coming back every few weeks and trying to see if any changes have been made to mojo to make this possible at all. Let me know if there is a way forward at some point. Alternative solutions like having layouts parametersized or fixed sizes are just workarounds not really tackling the issue exposed here.

The way Mojo handles the example add_const function is not great at the moment. We’re working on improving ctx.enqueue_function so that it should produce a type error, rather than failing silently at runtime. It is unlikely we make any changes in the near term that allow add_const to be used like that with ctx.enqueue_function.

That was my reasoning for closing the associated GitHub ticket. I’m happy to re-open if you feel this is a bug on our side.

Also just saw the related issue ticket on GitHub was closed but this is not fixed at all right? I don’t think this workaround which has drastically different implications merits closing the issue

I did not realize the use of UNKNOWN_VALUE was a must have for your example. It is definitely possible to keep that structure and the kernel use an unknown layout. The best way to do that would be to encode that constraint on the kernel registration itself, rather than just the function passed to ctx.enqueue_function.

from compiler_internal import StaticTensorSpec

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[
            static_spec=StaticTensorSpec[DType.float32, 1].create_unknown(),
        ],
        # Inputs
        x: InputTensor[
            static_spec=StaticTensorSpec[DType.float32, 1].create_unknown(),
        ],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
		# Rest of the code stays the same

Fair warning, you may need to update your modular dependency. The code above was tripping a different bug in the compiler that has since been fixed.

This could be made easier if we had some convenience functions to ‘erase’ the static layout of LayoutTensor or ManagedTensorSlice, but we don’t have those at the moment. Erasing the layout on the InputTensor and OutputTensor arguments will ensure that no code gets specialized on layouts, rather than just the GPU kernel.

2 Likes

Thank you for the reply. That did make the code run but it is compiling a seperate kernel on every size change. It is very obvious when running this code that it compiles for size 10 and does the 2 runs and then compiles for each very slowly.

I had to run rm -rf ~/.modular between tries to ensure it is not caching kernels.

example.py:

from pathlib import Path
import torch
from max.torch import CustomOpLibrary

mojo_kernels = Path(__file__).parent / "kernels"
op_library = CustomOpLibrary(mojo_kernels)
add_const_kernel = op_library.add_const

def add_const_1d(x: torch.Tensor) -> torch.Tensor:
    result = torch.zeros_like(x, dtype=x.dtype, device=x.device)
    add_const_kernel(result, x)
    return result

if __name__ == "__main__":
    x = torch.randn(10).cuda()
    print(add_const_1d(x))

    x = torch.randn(10).cuda()
    print(add_const_1d(x))

    x = torch.randn(2).cuda()
    print(add_const_1d(x))

    x = torch.randn(3).cuda()
    print(add_const_1d(x))

    x = torch.randn(4).cuda()
    print(add_const_1d(x))

    x = torch.randn(5).cuda()
    print(add_const_1d(x))

    x = torch.randn(6).cuda()
    print(add_const_1d(x))

kernels/kernel.mojo:

import compiler
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor, UNKNOWN_VALUE
from runtime.asyncrt import DeviceContextPtr
from math import ceildiv
from gpu.host import DeviceBuffer
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from compiler_internal import StaticTensorSpec

alias BLOCK_SIZE = 32
alias Dyn1DLayout = Layout.row_major(UNKNOWN_VALUE)
alias dtype = DType.float32

@compiler.register("add_const")
struct AddConst:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        # Outputs
        result: OutputTensor[static_spec=StaticTensorSpec[DType.float32, 1].create_unknown()],
        # Inputs
        x: InputTensor[static_spec=StaticTensorSpec[DType.float32, 1].create_unknown()],
        # Context
        ctx: DeviceContextPtr,
    ) raises:
        x_tensor = x.to_layout_tensor()
        result_tensor = result.to_layout_tensor()

        @parameter
        if target == "cpu":
            raise Error("Rasterize3DGS CPU target not implemented yet.")
        elif target == "gpu":
            # Get GPU context
            var gpu_ctx = ctx.get_device_context()

            # Define grid and block dimensions for the kernel launch
            var grid = (ceildiv(x.dim_size(0), BLOCK_SIZE))
            var block = (BLOCK_SIZE)

            gpu_ctx.enqueue_memset(
                DeviceBuffer[result.dtype](
                    gpu_ctx,
                    rebind[UnsafePointer[Scalar[result.dtype]]](result_tensor.ptr),
                    x.dim_size(0),
                    owning=False,
                ),
                0,
            )

            gpu_ctx.enqueue_function[add_const_kernel](
                x_tensor,
                result_tensor,
                x.dim_size(0),
                grid_dim=grid,
                block_dim=block,
            )
        else:
            raise Error("Unsupported target:", target)

fn add_const_kernel(
    x: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    result: LayoutTensor[dtype, Dyn1DLayout, MutableAnyOrigin],
    size: Int,
):
    i = block_idx.x * block_dim.x + thread_idx.x
    if i < size:
        result[i] = x[i] + 10

When running (wiith modular==25.5.0.dev2025071605) it is very noticeable it recompiles for every new size. Am I missing something?

You’re not missing anything at the moment. The size specialization + caching is how the system is currently intended to work. The behavior is a consequence of how we build a MAX graph based on the runtime shape of each input tensor.

We’ve discussed ways to give user control over this behavior, but this is the first concrete usecase we’ve had for such a feature.

It would be nice if we could pick up on kernel arguments like OutputTensor[static_spec=StaticTensorSpec[DType.float32, 1].create_unknown()] and realize there is no point in doing size specialization, but we don’t currently have the tooling to inspect kernels like that at the Python level.

In the short-term, we could expose a config option to disable size specialization, but there is no immediate workaround for your use case.

You can check out the more general graph_op which allows specifying input_types specifically for this case!

1 Like

Sounds good! A config to disable size specification would do the trick if you think it is an acceptable addition from your side in the short term.