MAX 26.1: eager to compile contract, lowering pipeline, kernel selection across GPUs, and extension points for custom ops

Hi Modular team :fire:

I missed the chance to ask this live in the internal meeting, but I think the answer would help a lot of people building with MAX and also contributors working on kernels and lowering.

Context
In MAX 26.1, the Python API supports a PyTorch-like eager workflow for iteration and debugging, and then a single model.compile() step to produce an ahead-of-time compiled artifact intended to run across heterogeneous accelerators.

Questions (deep dive welcome)

  1. Eager to compiled contract
  • What is the semantic contract between eager execution and compiled execution?

  • Which behaviors are guaranteed to match (numerics, determinism, error behavior)?

  • How do you handle dynamic shapes, control flow, and stateful ops when moving from eager to compiled?

  1. Lowering pipeline details
  • What are the concrete stages between Python and the final compiled artifact?

  • How are ops represented internally (graph IR), and where do canonicalization, fusion, scheduling, and codegen happen?

  • Where is the boundary between “frontend graph capture” vs “backend lowering”?

  1. Kernel selection across NVIDIA, AMD, Apple
  • How does kernel selection work across different backends?

  • What inputs drive selection (device properties, dtype/layout, shape ranges, cost model, autotuning)?

  • Is there a unified dispatch path, or backend-specific selection logic?

  • How are fallbacks handled if an op has no kernel on a target device?

  1. Stable extension points for external contributors
    As an active contributor, I’m especially interested in the most stable hooks:
  • How can contributors add custom ops end to end (Python API surface, shape inference, lowering, and runtime dispatch)?

  • Can contributors swap in alternative Mojo kernels (or provide multiple variants) while preserving the same op semantics and good debug ergonomics?

  • Which extension surfaces are intended to be stable vs internal and subject to change?

  1. Debuggability and attribution
  • When something fails (compile-time error, runtime error, perf regression), what is the best way to map it back to the originating Python/module code and the specific selected kernel?

  • Are there recommended tools or flags for dumping the graph/IR, selected kernels, and compilation decisions?

If there are docs or code pointers you recommend (specific directories, passes, or runtime components), I would love to follow along and potentially contribute improvements or documentation.

References

https://www.modular.com/blog/modular-26-1-a-big-step-towards-more-programmable-and-portable-ai-infrastructure

These are great questions! I’ll try to hit on ones that I think I know the answer to, but may need to phone in other Modular folks for other areas.

  1. Eager to compiled contract
    1. What is the semantic contract between eager execution and compiled execution?
    2. Which behaviors are guaranteed to match (numerics, determinism, error behavior)?

There’s a system that we’re experimenting with that may change some of this answer, but the general idea is that the overall computational behavior of eager-style execution vs. compiled should be roughly the same. The eager-style execution uses a lazy graph construction under the hood, so the same Mojo kernels are used. Fusion behavior may differ slightly, but even that can be adjusted by annotating with @F.functional in the new eager API.

We have some high-level documentation in the new MAX developer guide, but are trying to surface more detailed descriptions of the process from our internal docs.

How do you handle dynamic shapes, control flow, and stateful ops when moving from eager to compiled?

We handle dynamic shapes through symbolic dimensions. I need to look up all the conventions around symbolic dimensions and things like F.lazy(), but I do know that you can define a Modulethat takes in symbolic dimensions and use that with eager-style semantics. For control flow, we have some support for this in graph ops, but you’ll need to specify those operations yourself if you don’t want graph breaks. We also should support mutable tensors for in-place modification of state in operations, I want to say that our open source KV cache kernels would be good places to see this in action, but I can’t point you to specific code without looking a little longer.

Lowering pipeline details

  • What are the concrete stages between Python and the final compiled artifact?
  • How are ops represented internally (graph IR), and where do canonicalization, fusion, scheduling, and codegen happen?
  • Where is the boundary between “frontend graph capture” vs “backend lowering”?

For some of these, I might refer you to Feras’ recent talk on the graph compiler internals where he goes over this better than I could. I think it’s worth calling out that we’re much more explicit about the construction of the graph, we’re not really doing graph capture like torch.compile or other graph tracing. Yes, the eager-style semantics do a little bit of lazy graph creation under the hood, but that still is a bit more explicit about building a graph than some of these others.

Kernel selection across NVIDIA, AMD, Apple

  • How does kernel selection work across different backends?
  • What inputs drive selection (device properties, dtype/layout, shape ranges, cost model, autotuning)?
  • Is there a unified dispatch path, or backend-specific selection logic?
  • How are fallbacks handled if an op has no kernel on a target device?

If you want to see how a lot of this is handled today, you can read all of our kernel code in Mojo, which is completely available in open source. Specifically, in terms of dispatch, you can see a bunch of that logic in the main file where we register all of our top-level kernels.

We use compile-time metaprogramming in Mojo to do a lot of the hardware specialization. When a kernel is used in a graph, the graph compiler will provide information at compile time about the target hardware to the Mojo compiler. Using code like

@parameter
if target == "cpu":
    # CPU-specific kernel
    ...
else:
    # GPU kernel

and even further down into @parameter if is_nvidia_gpu() and _is_sm_100x_or_newer(): ..., you can very narrowly target code paths to the intrinsics and capabilities of exact hardware. This specialization can also include compile-time-known shapes that the graph compiler provides to Mojo for instances of a given kernel.

For fallbacks, those often have to be explicitly provided. A number of people have asked for this to be a little more gentle, for example by using a naive matmul by default for an unrecognized piece of hardware rather than failing with a compilation error that matmul is not yet implemented for this device. You can call out to vendor libraries if you’d like, for example by using cuBLAS kernels on NVIDIA systems if a Mojo kernel is missing or isn’t yet tuned for the target architecture.

I’ll continue in another post for your remaining questions:

1 Like

"Krish, great breakdown of the ‘eager-to-compiled’ contract. Since you’re looking into the lowering pipeline and stable extension points, how does MAX 26.1 handle hardware-specific side effects during the transition?

​Specifically, if I’m bypassing standard async/socket abstractions for direct DMA access on a DPU (treating the hardware almost like a P4-programmable target), does the model.compile() step provide hooks to preserve memory-mapped I/O constraints or custom memory layouts during the backend lowering phase, or is the compiler purely focused on tensor-op semantics?"

(part 2)

Stable extension points for external contributors
As an active contributor, I’m especially interested in the most stable hooks:

  • How can contributors add custom ops end to end (Python API surface, shape inference, lowering, and runtime dispatch)
  • Can contributors swap in alternative Mojo kernels (or provide multiple variants) while preserving the same op semantics and good debug ergonomics?
  • Which extension surfaces are intended to be stable vs internal and subject to change?

For standalone custom ops, we have a number of examples that demonstrate the general pattern used when writing these. The Mojo code in them is the same as is used for registering all of the ops in the standard MAX kernels library, so the distinction between custom ops and built-in ops is largely one of location (and which of the built-in ops are blessed as official operations in the MO dialect).

You can easily modify any of the built-in kernels (they are all in open source) or write your own and get the same semantics and debugging experience as you would with stock Mojo kernels that ship in the max package.

As far as the stability of these interfaces, they are relatively stable but we do sometimes make breaking API changes as we work on the graph compiler. We’ll typically message this and update all examples / documentation to match. One of our goals with Mojo 1.0 is to start stabilizing more of the language and APIs, and we’ll begin marking interfaces as to their relative stability. However, this may take a little bit to propagate to the kernels as well.

Debuggability and attribution

  • When something fails (compile-time error, runtime error, perf regression), what is the best way to map it back to the originating Python/module code and the specific selected kernel?
  • Are there recommended tools or flags for dumping the graph/IR, selected kernels, and compilation decisions?

When working with MAX graphs, you can set the MODULAR_MAX_DEBUG environment variable to True to get stack traces for graph issues. There’s an example of this in our basic graph programming guide. Also within that guide you’ll see how to print a graph, which will output the MLIR representation of the graph before any additional lowering. We have other internal tools for debugging kernel launches and observing other lowering steps in detail, but I’m not sure how many of those are currently exposed for general use.

Trojan, interesting question. From what Brad described, the current model.compile() story sounds primarily centered on tensor-op semantics plus kernel-level hardware specialization via Mojo @parameter paths. Preserving memory-mapped I/O constraints or strict custom layouts across lowering likely needs explicit modeling in the IR (custom ops or explicit memory/layout attributes) rather than expecting the compiler to infer or preserve side-effectful DMA constraints automatically.

My guess is: if the effect is “hardware side effects”, it should be expressed as an op with explicit effects/aliasing semantics and possibly a defined memory space, so the compiler knows what it can and cannot reorder or fuse around.

@Brad, would you say the right path here is (1) define a custom op in the MO dialect or adjacent dialect with explicit side-effect semantics, then (2) provide backend lowering or runtime hooks that map to that DPU/DMA mechanism?

Thanks a lot Brad, this is super helpful. A few follow-ups to make this actionable for contributors and to reduce ambiguity around “roughly the same” semantics.

1) Eager vs compiled contract: what is “guaranteed”

You mentioned eager uses lazy graph construction under the hood and ends up using the same Mojo kernels, with fusion possibly differing (and controllable via @F.functional).

  • Is the intended guarantee “same kernel semantics given the same chosen kernel + same inputs”, with differences mainly coming from kernel choice and fusion decisions?

  • Does @F.functional mainly constrain fusion boundaries / aliasing assumptions, or does it also affect things like CSE, buffer reuse, or scheduling?

  • Are there any current known differences between eager and compiled that users should expect (for example different error surfaces, different shape specialization behavior, different numerics when multiple valid kernels exist)?

2) Symbolic dimensions and dynamic shapes: conventions and constraints

You mentioned symbolic dimensions and hinted at conventions around F.lazy().

  • Is there a canonical way to declare symbolic dims in Python today (example snippet preferred), and what is the “shape contract” that the compiler assumes?

    • Example: are symbolic dims treated as runtime values with guards, or do we require bounded ranges, or both?
  • How are specialization and caching keyed for symbolic shapes?

    • For example: does model.compile() generate a single artifact with runtime guards, or does it JIT/AOT multiple variants behind the scenes?

3) Control flow and graph breaks

You said control flow is supported via graph ops if we do not want breaks.

  • Is the recommendation to model conditionals/loops using explicit graph ops (if/while) as part of a “graph programming” style?

  • Are there best practices or a guide for avoiding graph breaks in the eager API while keeping code readable?

4) Kernel dispatch and registration: pointers for contributors

This is the part I want to dig into in code.

  • Can you point to the “main file where we register top-level kernels” in the OSS repo?

  • When multiple kernel variants exist for an op (different tiling, dtype/layout, arch), how is the selection resolved?

    • Is it purely compile-time @parameter specialization, or is there also a higher-level selection layer (cost model, heuristics, autotune, per-device policy)?
  • On fallbacks: is the current expected pattern “explicitly provide fallback kernels” (even if naive), and would a “generic baseline fallback” PR be welcomed for common ops on new hardware targets?

5) Debugging: beyond MLIR print and MODULAR_MAX_DEBUG

Good to know about MODULAR_MAX_DEBUG=True and printing the pre-lowering MLIR graph.

  • Are there any currently supported flags to dump:

    • selected kernel names / variants

    • lowering stage snapshots (even if only a couple of key stages)

    • kernel launch traces (op to kernel mapping)

  • If not yet exposed, is there a preferred direction for community contributions here (for example “dump selected kernels” as a first step)?

6) KV cache example pointer

You mentioned the open source KV cache kernels as a good place to see mutable tensors / in-place state.

  • If you can share the directory path or a file name, I will go read it and report back with notes.

Also, if you can share a link to Feras’ graph compiler internals talk, that would help a lot. I will watch it and follow up with more targeted questions.

Perfectly phrased dude. Yes you are right if model.compile () is used the Graph Compiler fuses them all to one kernel failing to isolate the hardware side effects.

The “Fusion” Risk :

In standard AI, if you have two back-to-back operations, the compiler “fuses” them into one kernel to save memory bandwidth. However, for PAPDE, if one of those operations is actually a trigger for your DPU’s DMA engine, fusing it might cause the hardware signal to be lost or delayed. The compiler sees “math,” but you see “hardware timing.”

You are implementing custom ops which speak specific languages to different packets analysis on the DPU.

What needs to be done on custom ops:

Explicit Semantics: You tell the compiler exactly what the operation does. Crucially, you mark it as having “side effects,” which prevents the compiler from deleting or reordering it just because it doesn’t see a direct mathematical output.

Memory Mapping: It allows you to define specific memory layouts that match your DPU’s requirements, ensuring the data is exactly where the hardware expects it during a wire-speed packet capture.

Lowering to MLIR: Mojo uses MLIR (Multi-Level Intermediate Representation). A Custom Op allows you to “lower” your high-level code into a specific Dialect (like a hardware-specific language) that speaks directly to the DPU.

But don’t you think :thinking: that custom.ops still rely on fusion. And if they don’t why don’t they?

Another annihilating reality is that fusion still happens even in standard custom ops. The Graph Compiler still considers them as standard ops if they’re not well defined which might trigger a reorder in a standard packet analysis on a NIC.

How you prevent the fusion :

When you define a custom op in the Modular ecosystem (Mojo/MLIR), you aren’t just writing a function; you are defining its traits. To stop the compiler from fusing or reordering your DMA/Hardware logic, you must:

Specify Side-Effects: You mark the op with a trait (like MemoryEffects) that tells the compiler this operation interacts with the “outside world” (the DPU).

Define Memory Spaces: By assigning a specific defined memory space, you tell the compiler exactly where the data lives (e.g., a specific DPU buffer). This prevents the compiler from “fusing” it into a general-purpose register or cache.

Barrier Semantics: You can implement the op as a hardware barrier. This forces the compiler to finish all previous operations before this one starts and prevents any following operations from starting until the DPU signal is confirmed.

What do you think about this Krish

I’ll look for other examples of symbolic dimensions, but here’s a simple graph that uses them. In that case vector_width is a symbolic dimension that can be variable in length at runtime, but must match between the two input tensors when the graph is used. We can do algebraic checks on these dimensions, and I know there are more complex examples of their use in other open source code. From your description, I take these as “runtime values with guards”.

The file that I was referencing (and maybe forgot to link) is max/kernels/src/Mogg/MOGGKernelAPI/MOGGKernelAPI.mojo. That’s our primary registry for MAX kernels, and they all branch off from there.

For specialization, we largely lean on compile-time parameterization, with shapes and other compile-time values provided from the graph compiler. Unless I’m terribly mistaken or out of date with my information, we’re generally explicit with this specialization and don’t rely on a cost model / autotuning at the graph compiler or kernel level. We do autotuning ahead of time on hardware / shapes to provide optimized kernel parameters.

For general fallbacks, I think that’d be welcomed and there is at least one feature request for this. We just haven’t gotten to implementing the various paths where this would be used.

1 Like