Launch bounds for cuda kernels

emil.martens · July 1, 2025, 4:21pm

Hi, is there a way to set launch bounds when compiling CUDA kernels? When debugging is enabled, cuda easily runs out of registers for large block dimensions (1024). Setting launch bounds should force cuda to use local memory, solving this issue.

emil.martens · July 11, 2025, 8:53pm

Quinn suggested looking at this file, which showcase how to set the block size launch bound.

github.com/modular/modular

examples/custom_ops/kernels/matrix_multiplication.mojo

main


      
              simdwidthof,
          )
          from tensor_internal import (
              InputTensor,
              ManagedTensorSlice,
              OutputTensor,
          )
          from utils import StaticTuple
          from utils.index import Index
          
          # The number of threads per block to use for the optimized kernels.
          # Used only in llvm_metadata for MAX_THREADS_PER_BLOCK_METADATA.
          # Not the most performant for all kernels, used sparingly on nvidia accelerators.
          alias OPTIMIZED_NUM_THREADS = 256 if has_amd_gpu_accelerator() else 1024
          
          # The block size to use for the optimized kernels.
          alias OPTIMIZED_BLOCK_SIZE = 16 if has_amd_gpu_accelerator() else 32
          
          # ===-----------------------------------------------------------------------===#
          # Naive matrix multiplication (CPU)
          # ===-----------------------------------------------------------------------===#

system · July 18, 2025, 8:55pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SIMD loads on the GPU GPU Programming debugging , gpu	2	128	August 28, 2025
Modular: MAX 25.2: Unleash the power of your H200's–without CUDA! Content blog	0	49	March 25, 2025
"No active MLIR context" with new `CustomOpLibrary` torch integration MAX	24	395	July 21, 2025
Mojo manual gpu basics exercise does not compile GPU Programming 25_3	7	147	April 2, 2025
Examples of programming GPU functions using the Mojo MAX Driver API MAX discussion , gpu , 25_1	5	454	April 26, 2025

Launch bounds for cuda kernels

Related topics