Modular weekend hack project - NMS (non max suppression) kernel in Mojo + Pytorch integration with YOLOv10

I tried this in a previous hackathon and couldn’t get it to compile. Now I have the kernel ready. Will try integrating with Pytorch + YOLOv10 and see what improvements I can achieve.

This is largely experimental. The GPU speedups are seen for boxes 1000 and above in object detections. Practically this may not matter much as you mostly see < 100 detections per image.

The big wins for this may be simply the ability to keep the inputs and outputs on GPU, rather than incurring device->host copies, so I wouldn’t worry too much about the performance of the kernel itself. Just having it work reliably on GPU in a way that matches reference implementations is valuable.