Device-wide scheduling
Any combination of grid dimensions that don’t fit in the previous descriptions?
Is this how scheduling works?
If not, how can I launch e.g. several block clusters in tandem?
I’m trying to use this for my fft implementation to schedule batches of the independent workloads and have layers of fall-backs where I can still use shared memory, and use device-wide scheduling (using global memory) as a last resort.
Sorry about that, I did ping the team earlier but the 25.7 release plus the U.S. Thanksgiving holiday were distractions. I’ll try again, thanks for the reminder.
Apologies on dropping the ball on responding here, I had gotten some comments right before a company get-together and totally lost track.
Yongmei, our kernel team manager, had this to say:
From an Nvidia GPU scheduling perspective, a user’s program is automatically scheduled onto SMs by GPU hardware schedulers through multi-layers (CPU → grid → cluster-> SM).
Workload partition and shared memory are managed by users’ software. With Nvidia CUDA driver, kernel launches occur in order when issued on the same stream (or default stream), and in parallel when launched in multiple streams. Once the work is launched, scheduling is handled by GPU hardware. Unless the user introduces synchronization barrier, work execution is out of order. The granularity of work completion is a warp.
Abdul adds:
Block scheduling is not something that Mojo controls and NVIDIA does not mention how the scheduling works. It’s believed that blocks are scheduled in Z-order. Sometimes the ordering by the hardware scheduler is not what you want, in those advanced cases you can define your own schedule (see https://www.modular.com/blog/matrix-multiplication-on-blackwell-part-4—breaking-sota ) but this is a very advanced technique.