Hi Modular team ![]()
I missed the chance to ask this live in the internal meeting, but I think the answer would help a lot of people building with MAX and also contributors working on kernels and lowering.
Context
In MAX 26.1, the Python API supports a PyTorch-like eager workflow for iteration and debugging, and then a single model.compile() step to produce an ahead-of-time compiled artifact intended to run across heterogeneous accelerators.
Questions (deep dive welcome)
- Eager to compiled contract
-
What is the semantic contract between eager execution and compiled execution?
-
Which behaviors are guaranteed to match (numerics, determinism, error behavior)?
-
How do you handle dynamic shapes, control flow, and stateful ops when moving from eager to compiled?
- Lowering pipeline details
-
What are the concrete stages between Python and the final compiled artifact?
-
How are ops represented internally (graph IR), and where do canonicalization, fusion, scheduling, and codegen happen?
-
Where is the boundary between âfrontend graph captureâ vs âbackend loweringâ?
- Kernel selection across NVIDIA, AMD, Apple
-
How does kernel selection work across different backends?
-
What inputs drive selection (device properties, dtype/layout, shape ranges, cost model, autotuning)?
-
Is there a unified dispatch path, or backend-specific selection logic?
-
How are fallbacks handled if an op has no kernel on a target device?
- Stable extension points for external contributors
As an active contributor, Iâm especially interested in the most stable hooks:
-
How can contributors add custom ops end to end (Python API surface, shape inference, lowering, and runtime dispatch)?
-
Can contributors swap in alternative Mojo kernels (or provide multiple variants) while preserving the same op semantics and good debug ergonomics?
-
Which extension surfaces are intended to be stable vs internal and subject to change?
- Debuggability and attribution
-
When something fails (compile-time error, runtime error, perf regression), what is the best way to map it back to the originating Python/module code and the specific selected kernel?
-
Are there recommended tools or flags for dumping the graph/IR, selected kernels, and compilation decisions?
If there are docs or code pointers you recommend (specific directories, passes, or runtime components), I would love to follow along and potentially contribute improvements or documentation.
References
https://www.modular.com/blog/modular-26-1-a-big-step-towards-more-programmable-and-portable-ai-infrastructure