Are warps, blocks, block clusters, and device-wide workloads automatically scheduled depending on the grid dimensions?

Apologies on dropping the ball on responding here, I had gotten some comments right before a company get-together and totally lost track.

Yongmei, our kernel team manager, had this to say:

From an Nvidia GPU scheduling perspective, a user’s program is automatically scheduled onto SMs by GPU hardware schedulers through multi-layers (CPU → grid → cluster-> SM).

Workload partition and shared memory are managed by users’ software. With Nvidia CUDA driver, kernel launches occur in order when issued on the same stream (or default stream), and in parallel when launched in multiple streams. Once the work is launched, scheduling is handled by GPU hardware. Unless the user introduces synchronization barrier, work execution is out of order. The granularity of work completion is a warp.

Abdul adds:

Block scheduling is not something that Mojo controls and NVIDIA does not mention how the scheduling works. It’s believed that blocks are scheduled in Z-order. Sometimes the ordering by the hardware scheduler is not what you want, in those advanced cases you can define your own schedule (see https://www.modular.com/blog/matrix-multiplication-on-blackwell-part-4—breaking-sota ) but this is a very advanced technique.

1 Like