Question on register custom ops for max on different device

Hi folks,

I am new to modular and looking at some demos/documentsions on MAX compiler.

Ideally, I want to define an op that can be “overrided” with different implementations on different target.

In the documented example, the target switching is done inside if/else target check inside the node definition,

@compiler.register("vector_addition")
struct VectorAddition:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        output: OutputTensor[rank=1],
        lhs: InputTensor[dtype = output.dtype, rank = output.rank],
        rhs: InputTensor[dtype = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        @parameter
        if target == "cpu":
            _vector_addition_cpu(output, lhs, rhs, ctx)
        elif target == "gpu":
            _vector_addition_gpu(output, lhs, rhs, ctx)
        else:

While this works in general, there might be a serious issue with scalarization:

Assume I wrote the original “general purpose op”, and my collaborators target authors want to provide “more advanced target-specific op” . Now the target authors will have to go into my node definition and modify the code there.

Ideally, they should be able to register their own version without modifying the initial node definition I wrote. Is there a way to register multiple versions of a custom node, each for a different target in different struct (and files)?

e.g., something like this

@compiler.register("mo.matmul")
struct MatmulReference:
    @staticmethod
    fn executetarget: StaticString:
        # Standard nested loops or simple SIMD code
        ...

@compiler.register("mo.matmul")
struct MatmulCUDA:
    @staticmethod
    fn executetarget: StaticString:
        # Hand-written PTX or calls to cuBLAS

Thank you very much.

Ye

I’ve edited your post to add code block formatting for ease of discussion. In the future, please use code blocks for code because otherwise Markdown messes up the formatting.

In general, it’s expected that you go through a single custom that does the equivalent operation on multiple devices. Otherwise, you force the scalarization into other people’s code, where they are less likely to have the most up to date version of the options. There are a few ideas floating around for how to do this slightly more generically, but you very quickly run into questions of who should “win” if two people register a custom op for target-specific matmul. In order to answer that question, you need to both know a bunch about where the strengths and weaknesses of both kernels lie, and run logic to determine if the current input is better suited to either kernel. It might be possible to use autotuning to determine this, but autotuning massively increases compile times and it’s often better to do that offline. All of this circles back to “someone needs to write a tree of if statements”, so I’m not sure that we really solve anything by offering this capability.

Thanks for the quick response.

I think what you said makes sense if everyone is from the same team or organization.

However, in reality, this is not true from product perspective.

Say I am from company A who build a product based on modular and authored a generic algorithm for a fft node.

Someone from Qualcomm looked at my product documentation and think their built-in algorithm is numerically equivalent and more efficient for a specific processor.
At this point, it doesn’t really make sense to for him/her read my node implementation and modify it.(note: my node definition may contain thousands lines of code, or maybe the code is close-sourced)
In this case, it makes a lot of sense for hardware author to override the node without digging into the implementation of the node (they just need to look at the documentation I provide)

As for the priority/conflict problem, I think it might be possible to provide a tracing mechanism showing possible alternate node and provide an option for the end user to choose the one they want (e.g. via an external config file or something)

Thank you very much.

Ye

In this case, it makes a lot of sense for hardware author to override the node without digging into the implementation of the node (they just need to look at the documentation I provide)

Then what happens when someone beats the hardware vendor? Or, even worse, when they only beat the vendor in a particular area, say very large FFTs or very small FFTs. What if the new kernel runs a lot faster but has numerics issues that make it unsuitable for scientific simulations?

Handling closed-source kernels is still a work in progress issue, since you can nearly recover the source code from a .mojopkg file right now.

I don’t think there’s a great technical solution to this problem that doesn’t devolve into the library equivalent of z-index: 999999999999, so I think this would need to be solved by people have a discussion and putting things in a unification point. Qualcomm might reach out to you and request you integrate some open source code or some closed code under a licensing agreement to help make your product use their hardware better. Providing mechanisms for someone to override parts of a product without the product vendor having some level of control would also likely lead to “your support agreement is void if you do this” clauses, which gets us back to square one.

As for the priority/conflict problem, I think it might be possible to provide a tracing mechanism showing possible alternate node and provide an option for the end user to choose the one they want (e.g. via an external config file or something)

This should be trivially possible. You can provide a config file that is just a mapping of operations to the op name in the max graph, and end users can change them as desired. Combined with an array field for extra kernel libraries to load, this will provide end users the customization you want at a high level. My concerns mostly come from trying to address this at a much lower level.

Hi Owen,

Thank you very much for the reply.

In industrial production, hardware-specific kernels are normally tested exhaustively against the original algorithm, so the concerns regarding numerical consistency or performance regressions are generally mitigated.

On the other hand, modifying the source code is often not realistic, especially in the case of closed-source software.

I understand the hesitation to provide a general solution to allow multiple hooks. Another possibility would be for the advanced-user of the Max compiler to provide a custom hook mechanism, such as a mini transform on the graph. However, this would require the ability to understand and modify the source code for the MAX compiler.

Are there any plans to open source MAX compiler in the near future so that advanced user can add their own transform/capability?

Thank you very much.

Best regards,

Ye

Yours Truly
Ye Yang
Department of MEMS
Duke University

I understand the hesitation to provide a general solution to allow multiple hooks. Another possibility would be for the advanced-user of the Max compiler to provide a custom hook mechanism, such as a mini transform on the graph. However, this would require the ability to understand and modify the source code for the MAX compiler.

Not necessarily, consider this example: modular/max/examples/custom_ops/addition.py at main · modular/modular · GitHub

In that example, you can swap “add_one” for “add_one_qualcomm_super_kernel” via something as simple as a python cli argument from argparse so long as it provides a compatible interface, and you can provide a user-defined path to the custom_extensions argument of the Graph constructor, which would allow users to provide arbitrary kernels. There are less “messy” ways to define the graph so it looks a bit more like normal code, and I think that swapping the value of a string provided as an argument should be within the capabilities of most programmers.

As far as I am aware, the current plan for MAX open sourcing is “probably eventually”, with most of the focus being on Mojo. However, given that you don’t really need to modify the compiler for this, I don’t see that as a downside.

Hi Owen,

Thank you for the information.

This workflow sounds much more promising and realistic to me now. It seems I can simply add a layer of wrapper on the node to allow for the hook as you described.

I appreciate the help!

Best regards,

Ye

Yours Truly
Ye Yang
Department of MEMS
Duke University