[Hackathon] NMS (non max suppression) kernel in Mojo + Pytorch integration with YOLOv10

pbanavara · June 29, 2025, 6:14pm

I tried this in a previous hackathon and couldn’t get it to compile. Now I have the kernel ready. Will try integrating with Pytorch + YOLOv10 and see what improvements I can achieve.

This is largely experimental. The GPU speedups are seen for boxes 1000 and above in object detections. Practically this may not matter much as you mostly see < 100 detections per image.

BradLarson · June 29, 2025, 7:19pm

The big wins for this may be simply the ability to keep the inputs and outputs on GPU, rather than incurring device->host copies, so I wouldn’t worry too much about the performance of the kernel itself. Just having it work reliably on GPU in a way that matches reference implementations is valuable.

pbanavara · July 1, 2025, 3:42pm

Thanks for your feedback Brad. It’s also valuable only for large number of objects in an image which isn’t very practical. How often do you see 4000 objects in an image :). On images with < 100 boxes the CPU and GPU NMS ops have similar latencies. Nonetheless, I was happy I got the kernel implemented as a total noob.

system · December 28, 2025, 3:42pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Modular: Modular GPU Kernel Hackathon Highlights: Innovation, Community, & Mojo🔥 Content blog	0	70	May 20, 2025
I’m a Mojo newbie... So I decided to Rewrite Unsloth’s NF4 Kernel To Validate Modular's Mission Performance	4	395	December 23, 2025
[Hackathon] YOLOv8 Performance Benchmark: PyTorch vs. Modular MAX Community Showcase modular-hack-weekend	3	106	December 26, 2025
[Hackathon] BMM + Conv2d kernels: https://github.com/ESPR3SS0/Hackathon-Mojo Community Showcase modular-hack-weekend	1	68	December 26, 2025
Mojo on Jetson Orin Mojo	4	252	December 13, 2025

[Hackathon] NMS (non max suppression) kernel in Mojo + Pytorch integration with YOLOv10

Related topics