I tried this in a previous hackathon and couldn’t get it to compile. Now I have the kernel ready. Will try integrating with Pytorch + YOLOv10 and see what improvements I can achieve.
This is largely experimental. The GPU speedups are seen for boxes 1000 and above in object detections. Practically this may not matter much as you mostly see < 100 detections per image.