Are warps, blocks, block clusters, and device-wide workloads automatically scheduled depending on the grid dimensions?

martinvuyk · November 23, 2025, 3:20pm

I have a couple of cases to illustrate my doubts. This is how I think it works but I’m not sure:

Warp scheduling

ctx.enqueue_function[func](
  *args,
  # grid_dim[1] <= max_warps_per_sm * sm_count
  grid_dim=(1, amount_of_warps),
  block_dim=less_than_device_warp_size
)

Block scheduling

ctx.enqueue_function[func](
  *args,
  # grid_dim[1] <= max_threads_per_sm * sm_count // block_dim
  grid_dim=(1, amount_of_blocks),
  block_dim=less_than_device_max_threads_per_block
)

Block cluster scheduling

ctx.enqueue_function[func](
  *args,
  # grid_dim[0] <= max_cluster_size (cudaOccupancyMaxPotentialClusterSize)
  # grid_dim[1] <= max_threads_per_sm * sm_count // (block_dim * grid_dim[0])
  grid_dim=(blocks_per_cluster, amount_of_clusters),
  block_dim=less_than_device_max_threads_per_block
)

Device-wide scheduling
Any combination of grid dimensions that don’t fit in the previous descriptions?

Is this how scheduling works?
If not, how can I launch e.g. several block clusters in tandem?

I’m trying to use this for my fft implementation to schedule batches of the independent workloads and have layers of fall-backs where I can still use shared memory, and use device-wide scheduling (using global memory) as a last resort.

duck_tape · December 1, 2025, 2:43pm

@martinvuyk did you ever solve this yourself? I’ve wondered the same thing and never really sorted out how it works.

martinvuyk · December 1, 2025, 4:13pm

Nope, I hoped @BradLarson would manage to pull someone that could answer this into this thread

BradLarson · December 1, 2025, 10:24pm

Sorry about that, I did ping the team earlier but the 25.7 release plus the U.S. Thanksgiving holiday were distractions. I’ll try again, thanks for the reminder.

BradLarson · December 16, 2025, 7:12pm

Apologies on dropping the ball on responding here, I had gotten some comments right before a company get-together and totally lost track.

Yongmei, our kernel team manager, had this to say:

From an Nvidia GPU scheduling perspective, a user’s program is automatically scheduled onto SMs by GPU hardware schedulers through multi-layers (CPU → grid → cluster-> SM).

Workload partition and shared memory are managed by users’ software. With Nvidia CUDA driver, kernel launches occur in order when issued on the same stream (or default stream), and in parallel when launched in multiple streams. Once the work is launched, scheduling is handled by GPU hardware. Unless the user introduces synchronization barrier, work execution is out of order. The granularity of work completion is a warp.

Abdul adds:

Block scheduling is not something that Mojo controls and NVIDIA does not mention how the scheduling works. It’s believed that blocks are scheduled in Z-order. Sometimes the ordering by the hardware scheduler is not what you want, in those advanced cases you can define your own schedule (see https://www.modular.com/blog/matrix-multiplication-on-blackwell-part-4—breaking-sota ) but this is a very advanced technique.

Topic		Replies	Views
Optimize GPU performance Mojo	1	103	February 4, 2026
How do I sync threads between blocks i.e. device-wide? GPU Programming	7	241	August 14, 2025
Launch_bounds support for GPU code Mojo gpu	4	122	February 18, 2026
How to distribute data into blocks? GPU Programming	1	53	June 1, 2026
Launch bounds for cuda kernels Mojo gpu , mojo-compiler	1	163	July 11, 2025

Are warps, blocks, block clusters, and device-wide workloads automatically scheduled depending on the grid dimensions?

Related topics