[Hackathon] Creating a Yolov10 architecture using max graphs

1 Like

cool!

2 Likes

Thanks @alix! This looks fantastic. One thing that would help us out is an end-to-end example of loading weights and running some basic inference on some test cases. max serve doesn’t support image classification workflows, so the best path forward on this would be with a direct Python script.

It’s really exciting to see the work on vision classification models coming up. Thanks for sharing this work!

2 Likes

Great start! From the structure of the tests and documentation, it looks like you’ve been pretty successful in using an agentic coding tool to help generate the architecture here. How did you find the experience in doing so? How hard was it to get the agent to follow our design and documentation?

One suggestion: right now the tests are the bare minimum that Claude will generate (can it load a Python module), and Claude will happily mark that as passing tests and being functional, even though the model doesn’t work. I’ve found that in order to get Claude and other agents to really get into the meat of the model, you need to provide them specific test targets and goals. For example : “produce a script that will take these images as input and return correct bounding boxes and classifications that match this example model’s output” or “run this model using max generate --custom-architecture XXX and verify it produces correct text”. Making it actually run the model will cause it to find and correct runtime issues, and verify that it builds out weight loading, etc. to get the model to function.

1 Like

Hello Everyone,

Thanks so much for the kind words — I really appreciate it!

I completely agree about the value of an end-to-end example. I’ve been working toward that, but ran into a few challenges along the way that I wanted to share — along with some broader context.

Background & Motivation

This project is something I’ve been trying to crack for a while now — building a reliable pipeline that can detect and classify layout elements (like table rows, fields, or line items) from real-world document images, especially invoices. I’ve explored multiple approaches over time, including using LLMs, object detection models, and layout-aware vision models — so this submission is the latest iteration of that journey.

Exploring Phi-3 Vision

Alongside working with MAX Graph, I’ve also been experimenting with Microsoft’s Phi-3 Vision model (context here) to extract layout elements from documents. While Phi-3 Vision is powerful for layout detection, it’s not optimized for YOLO-style object detection — so I’ve been considering a hybrid approach:

  • Use Phi-3 Vision for coarse document structure (e.g., header, footer, line-item block), and
  • Use MAX Graph + YOLOv10 for fine-grained bounding boxes and class predictions.

My goal is to bring these together in a Python script that loads models, processes real invoices, and returns predictions both as JSON output and visual annotations — suitable for real document automation workflows.

Experience Using Cursor

I used Cursor throughout the development process, especially for navigating the modular codebase, prototyping MAX Graph modules, and generating/refining initial test scripts. It was incredibly helpful for context-aware code suggestions and for juggling multiple design iterations quickly. That said, I still had to iterate manually quite a bit to get MAX Graph modules wired up correctly, especially when it came to shape mismatches and layer definitions.

Challenges I Encountered

Input Preprocessing & Output Postprocessing
Replicating YOLOv10’s preprocessing pipeline required careful alignment with the original implementation — including image resizing, channel reordering, and normalization. Postprocessing is still in progress and involves decoding raw tensor outputs into bounding boxes, mapping class labels, restoring coordinates to the original image size, and applying Non-Maximum Suppression (NMS). These steps are critical to ensure the outputs are usable in real-world scenarios.

I’m now working on:

  • A run_inference.py script that loads weights, processes a test image, and outputs both visual and structured predictions.
  • Wrapping both Phi-3 and YOLO logic into a hybrid pipeline.
  • Writing goal-driven tests to validate weight loading and output correctness, not just module compilation.

Thanks again for the thoughtful feedback and support — it’s been a very rewarding experience digging into MAX Graph, and I’m excited to keep refining this!

Best,

1 Like

Thank you Chris , I am going to take a long screen shot of this comment, frame it and look at it in the morning every day !

Thank you for the suggestion Chris, i will be working on it for sure.