Hi, is there a way to set launch bounds when compiling CUDA kernels? When debugging is enabled, cuda easily runs out of registers for large block dimensions (1024). Setting launch bounds should force cuda to use local memory, solving this issue.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Defining GPU Thread-Local Variables in Mojo | 0 | 19 | July 9, 2025 | |
Doubt related to Mojo and direct GPU memory access | 4 | 158 | April 17, 2025 | |
Ask Ahmed anything about GPU programming with Mojo (LLVM Developers' Meeting 2024) | 10 | 738 | June 8, 2025 | |
GPU Float64 memset support | 0 | 74 | February 17, 2025 | |
Freestanding/Bare-Metal Stdlib: Supporting OS Development and Accelerator Targets | 10 | 170 | June 21, 2025 |