[Hackathon] YOLOv8 Performance Benchmark: PyTorch vs. Modular MAX

Mojonized · June 29, 2025, 10:11pm

TL;DR:

Used Max to port a minimal implementation of yolo v8 nano extracted from Ultralytics lib.
Accuracy is lower compared to the PyTorch model, I suspect: InterpolationMode, only BICUBIC is available from Max, PyTorch could be using BiLinear or Nearest.
On the Inference Benchmark, it is showing at least 3x the improvement.
On the Accuracy Benchmark, it is always low, so I didn’t add it.

Note: AI is heavily used in this; if you find any inaccuracy, pls feel free to correct.

Appreciation: shoutout to kapa.ai for answers.

Github URL: june-hackathon

Benchmark Results

Detailed Difference between PyTorch and Max Implementation.

High-Level Answer: Is the Change Significant or “Meh”?

The change is highly significant. It’s not just a minor syntax update—it represents a fundamental shift in the execution paradigm:

PyTorch: Uses an eager execution model. Networks are defined and run dynamically, operation by operation—ideal for flexibility and research.
Mojo/Modular MAX: Uses a graph-based, ahead-of-time (AOT) compilation model. It defines computation as a static graph using Python syntax, which is then compiled by the MAX Engine into a highly optimized binary targeting specific hardware.

Think of it like this:

PyTorch (Eager): An interpreter reading and executing your code line-by-line.
Modular MAX (Graph): A compiler that analyzes and optimizes your program into a fast, standalone application.

Detailed Breakdown of Key Changes

1. The Core Engine: `torch` vs. `max`

PyTorch:

Uses torch and torch.nn.
Executes ops immediately via the PyTorch runtime.

import torch.nn as nn

# This object IS the runnable layer
conv_layer = nn.Conv2d(3, 64, 3)
output = conv_layer(input_tensor)  # Execution happens here

Modular MAX:

Uses max.graph.ops to build a graph (no immediate execution).

from max.graph import ops

# This object DESCRIBES a convolution
conv_out = ops.conv2d(x, self.weight, ...)  # Adds a node to the graph
# No computation has happened yet.

2. Model Definition: `nn.Module` vs. Graph-Building Classes

PyTorch:

Uses nn.Module subclasses (Conv, C2f, SPPF) containing layers and nn.Parameters.
The forward method defines dynamic logic.

class Conv(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), ...)
        self.bn = nn.BatchNorm2d(c2)
        # ...

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

Modular MAX:

Defines custom classes like MaxConv, MaxSPPF that describe the computation.

class MaxConv:
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True, name_prefix=""):
        self.weight = Weight(name=f"{name_prefix}.conv.weight", ...)
        self.bias = Weight(name=f"{name_prefix}.conv.bias", ...)
        # ...

    def __call__(self, x):
        conv_out = ops.conv2d(x, self.weight, ...)
        biased_out = conv_out + self.bias.to(x.device).reshape(...)
        return self.act(biased_out)

3. Weight Handling: Direct Loading vs. Fusion

PyTorch:

Loads weights with state_dict; BN is separate.

model.load_state_dict(state_dict, strict=True)

Modular MAX:

Explicitly fuses BatchNorm into Conv weights (for inference optimization).

# Fusing Conv + BN
bn_weight_key = f"{bn_prefix}.bn.weight"
if bn_weight_key in state_dict:
    # ... math to fuse bn_w, bn_b, bn_rm, bn_rv into conv_w ...
    fused_w = conv_w * scale.view(-1, 1, 1, 1)
    fused_b = bn_b - bn_rm * scale
    fused_weights_temp[target_key] = fused_w.numpy()
    fused_weights_temp[target_bias_key] = fused_b.numpy()

4. Inference Pipeline: Implicit vs. Explicit Compile

PyTorch:

Simple pipeline: eval() + inference.

model = load_yolo_model_from_pt(...)
model.eval()
with torch.no_grad():
    predictions = model(image_tensor)

Modular MAX:

Requires an explicit compile step before execution.

max_model = load_yolo_model_from_pt(...)  # Loads weights into placeholders
session = engine.InferenceSession()       # Creates a runtime session
max_model.compile(session)                # <-- THE CRITICAL COMPILE STEP
processed_output = max_model(max_tensor)  # Executes compiled graph

5. Data Handling: `torch.Tensor` vs. `max.driver.Tensor` and `NumPy`

PyTorch:

Entire pipeline uses torch.Tensor.

Modular MAX:

Uses max.driver.Tensor internally.
Requires conversion from NumPy → max Tensor → NumPy (for post-processing).
Post-processing (like dfl_numpy, dist2bbox_numpy) is written in pure NumPy.

Conclusion

The Mojo/Modular MAX script is a re-architecture, not a rewrite.

Feature	PyTorch Script (Eager Execution)	Mojo/Modular Script (Graph Compilation)	Significance
Paradigm	Dynamic, flexible, interpreter-like	Static, optimized, compiler-like	Massive
Core Lib	`torch.nn`	`max.engine`, `max.graph`	Massive
Model Code	Defines runnable `nn.Modules`	Defines graph-describing classes	Significant
Weights	Loads weights directly, BN is separate	BN is fused manually into Conv layers	Significant
Inference	`model(tensor)`	`session.compile()`, then `model.execute()`	Significant
Post-Proc	Done with Torch tensors	Done with NumPy arrays	Minor

Final Thoughts

This Mojo/Modular MAX implementation is an excellent example of moving a model from a flexible research framework (PyTorch) to a production-grade, high-performance system using AOT compilation. The changes aren’t just about performance—they represent a full stack shift from dynamic to optimized execution.

–

Outputs:

PyTorch

Max

BradLarson · June 29, 2025, 10:26pm

How are you measuring whether the MAX model is generating results that match the PyTorch version?

The logic on the weight handling is interesting, how does it do translation from the PyTorch weights to a format that MAX can use?

Mojonized · June 29, 2025, 10:59pm

For accuracy, I am just comparing outputs of both, using:

if np.allclose(pytorch_np_transposed, max_np, rtol=1e-3, atol=1e-4):

for Relative tolerance and Absolute tolerance.

Its output:

For translation, we are reading weight while ignoring batchnorms in the function:

def load_and_fuse_state_dict(self, state_dict):,

and output is a dictionary where keys are strings, matching the MAX graph’s Weight placeholders.

graph TD
    A[PyTorch .pt file] --> B{torch.load};
    B --> C[PyTorch state_dict\n(dict of torch.Tensors, NCHW layout)];
    C --> D{load_and_fuse_state_dict method};

    subgraph D [Conversion Logic]
        E{For each weight in state_dict} --> F{Has BatchNorm?};
        F -- Yes --> G[Fuse Conv+BN Math\n(in PyTorch)];
        F -- No --> H[Use Original Conv Weight];
        G --> I{Convert Tensor to NumPy & Transpose Layout};
        H --> I;
        I --> J[Store in new dictionary];
    end

    D --> K[Final 'fused_weights' dictionary\n(dict of np.ndarrays, NHWC layout)];
    K --> L{max_model.compile()};
    L --> M[Compiled, Optimized MAX Engine Executable];

Topic		Replies	Views
Porting various models to MAX MAX	6	201	May 8, 2025
[Hackathon] Pytorch Model Converter to Max Community Showcase modular-hack-weekend	0	54	June 29, 2025
About the decision of removing the Max Tensor APIs in Mojo MAX	9	280	May 27, 2025
Community meeting question: MAX speed loading weights & GPU warmup MAX	6	121	August 12, 2025
ONNX: difference in MAX cpu <-> gpu execution MAX debugging , 25_2	3	259	April 15, 2025

[Hackathon] YOLOv8 Performance Benchmark: PyTorch vs. Modular MAX

Benchmark Results

Detailed Difference between PyTorch and Max Implementation.

High-Level Answer: Is the Change Significant or “Meh”?

Think of it like this:

Detailed Breakdown of Key Changes

1. The Core Engine: torch vs. max

2. Model Definition: nn.Module vs. Graph-Building Classes

3. Weight Handling: Direct Loading vs. Fusion

4. Inference Pipeline: Implicit vs. Explicit Compile

5. Data Handling: torch.Tensor vs. max.driver.Tensor and NumPy

Conclusion

Final Thoughts

Outputs:

Related topics

1. The Core Engine: `torch` vs. `max`

2. Model Definition: `nn.Module` vs. Graph-Building Classes

5. Data Handling: `torch.Tensor` vs. `max.driver.Tensor` and `NumPy`