Are warps, blocks, block clusters, and device-wide workloads automatically scheduled depending on the grid dimensions?

BradLarson · December 16, 2025, 7:12pm

Apologies on dropping the ball on responding here, I had gotten some comments right before a company get-together and totally lost track.

Yongmei, our kernel team manager, had this to say:

From an Nvidia GPU scheduling perspective, a user’s program is automatically scheduled onto SMs by GPU hardware schedulers through multi-layers (CPU → grid → cluster-> SM).

Workload partition and shared memory are managed by users’ software. With Nvidia CUDA driver, kernel launches occur in order when issued on the same stream (or default stream), and in parallel when launched in multiple streams. Once the work is launched, scheduling is handled by GPU hardware. Unless the user introduces synchronization barrier, work execution is out of order. The granularity of work completion is a warp.

Abdul adds:

Block scheduling is not something that Mojo controls and NVIDIA does not mention how the scheduling works. It’s believed that blocks are scheduled in Z-order. Sometimes the ordering by the hardware scheduler is not what you want, in those advanced cases you can define your own schedule (see https://www.modular.com/blog/matrix-multiplication-on-blackwell-part-4—breaking-sota ) but this is a very advanced technique.

Topic		Replies	Views
Optimize GPU performance Mojo	2	83	February 4, 2026
How do I sync threads between blocks i.e. device-wide? GPU Programming	7	175	August 14, 2025
Launch_bounds support for GPU code Mojo gpu	5	76	February 25, 2026
Launch bounds for cuda kernels Mojo gpu , mojo-compiler	2	146	July 11, 2025
GPU Programming Manual Community Showcase gpu , docs , modular-content	18	798	September 22, 2025

Are warps, blocks, block clusters, and device-wide workloads automatically scheduled depending on the grid dimensions?

Related topics