Hi, is there a way to set launch bounds when compiling CUDA kernels? When debugging is enabled, cuda easily runs out of registers for large block dimensions (1024). Setting launch bounds should force cuda to use local memory, solving this issue.
Quinn suggested looking at this file, which showcase how to set the block size launch bound.
simdwidthof,
)
from tensor_internal import (
InputTensor,
ManagedTensorSlice,
OutputTensor,
)
from utils import StaticTuple
from utils.index import Index
# The number of threads per block to use for the optimized kernels.
# Used only in llvm_metadata for MAX_THREADS_PER_BLOCK_METADATA.
# Not the most performant for all kernels, used sparingly on nvidia accelerators.
alias OPTIMIZED_NUM_THREADS = 256 if has_amd_gpu_accelerator() else 1024
# The block size to use for the optimized kernels.
alias OPTIMIZED_BLOCK_SIZE = 16 if has_amd_gpu_accelerator() else 32
# ===-----------------------------------------------------------------------===#
# Naive matrix multiplication (CPU)
# ===-----------------------------------------------------------------------===#