[Hackathon] NMS (non max suppression) kernel in Mojo + Pytorch integration with YOLOv10

I tried this in a previous hackathon and couldn’t get it to compile. Now I have the kernel ready. Will try integrating with Pytorch + YOLOv10 and see what improvements I can achieve.

This is largely experimental. The GPU speedups are seen for boxes 1000 and above in object detections. Practically this may not matter much as you mostly see < 100 detections per image.

The big wins for this may be simply the ability to keep the inputs and outputs on GPU, rather than incurring device->host copies, so I wouldn’t worry too much about the performance of the kernel itself. Just having it work reliably on GPU in a way that matches reference implementations is valuable.

Thanks for your feedback Brad. It’s also valuable only for large number of objects in an image which isn’t very practical. How often do you see 4000 objects in an image :). On images with < 100 boxes the CPU and GPU NMS ops have similar latencies. Nonetheless, I was happy I got the kernel implemented as a total noob.