Launch bounds for cuda kernels

Hi, is there a way to set launch bounds when compiling CUDA kernels? When debugging is enabled, cuda easily runs out of registers for large block dimensions (1024). Setting launch bounds should force cuda to use local memory, solving this issue.