I agree that bounds checking by default is good, and that removing negative indexing in pursuit of that is sensible.
On GPU it’s actively harmful, GPU architectures execute in lockstep warps/wavefronts, and divergent branches cause both paths to execute serially. An extra branch on every element access in a tight kernel is a measurable performance regression.
More critically, on GPU we disable bounds checking for this exact reason, branching overhead is unacceptable on hot paths.
Here, I disagree. If you get divergence in bounds checking on a GPU, it means one thread is going off the end of a buffer and your code is incorrect, full stop, do no pass go, do not collect 200 tokens. At that point, you’re going to be tearing down the kernel and shipping what debug information you can to the user, so performance is no longer a concern.
Additionally, I think that part of democratizing GPU programming means bringing some of the “creature comforts” of CPUs over to GPUs, potentially at performance cost, to make programming GPUs easier. There is a decent variety of mistakes a new user could easily make when first learning GPU programming that would lead to the first or last threads of a kernel to go off of a buffer, and I think the somewhat unfamiliar programming model may actually make that more likely for new users. Given that on a consumer GPU there’s a decent change that the memory on other side is mapped due to a lack of vram and Mojo seeming to generally prefer to not use the virtual addressing APIs. This memory pressure also means that classic ways of dealing with this, such as no-read, no-write guard pages, will be difficult to implement.
This leaves a question on how to implement the fast path, and I see two options. First, the exact same thing as is done on CPU, .unsafe_get(...). This makes the language more consistent, since each bit of hardware doesn’t need to be picked as “more CPU-like” or “more GPU-like” with bounds safety based on the likely subjective determination made by whoever brings up the architecture. Here are a few examples of things which heavily blur that line to show why I think this is subjective at best:
- The Bolt Graphics Zeus “GPU”
- This is a GPU that uses an out of order RISC-V core with up to 6-10 wide dispatch, has a very substantial amount of raytracing acceleration, has a 2048-bit VLEN (512-bit DLEN). It also natively runs Linux. What I have been shown about the chip leads me to believe that the claims of beating a 5090 in raytracing are credible. So, does Mojo leave bounds checking on, since it’s RISC-V, or is it disabled since it has a large (presented) vector width, or is it back on since it’s out of order?
- The AiNekko ET-SoC-1 (formerly from Esperanto)
- 1088 cores of in-order RISC-V, each with a 256-bit vector unit and 512-bit tensor unit, with GPU-like scratchpad memory and caches requiring gpu-like explicit synchronization. There’s work to make it act as an “llvmpipe offload” to make it into an open hardware GPU since the RTL is also in the process of being open sourced.
- Intel’s Larrabee in the original “it has display out” config (Later turned into Xeon Phi)
- It’s a huge pile of Atom cores, now called ecores. The lineage out to Xeon Phi definitely falls under the “accelerator” category, but it’s also just a big pile of x86 cores that can run Linux. These cores were in-order with SMT 4 and AVX512, so getting somewhat GPU-like.
- I can produce more examples if that is insufficent, but I think my point is made.
All of those listed examples are places where I’d want to be able to re-use GPU SIMT kernels, especially giving the highly composable direction of newer kernels. This means that I’d want the same unchecked access that GPUs want so that indexing operations map more closely to scatter/gather. I also think that this would harshly limit the ability to grab libraries designed for CPU and throw them on GPUs for GPGPU things, especially for cases where the iGPU would otherwise go unused.
The other option is to lean hard into dependent types and do this the way Ada does it, which provides safety and runtime performance at compile-time cost. This would mean that comptime_range[0, 10]() would produce type Integer[min=0, max=10], which can be shown at compile-time to be a valid to index a TileTensor[DType.bfloat16, Layout.row_major(32), ...] without any need for bounds checking. I see this as the ideal way to handle this problem for performance sensitive code, since it provides both performance and correctness benefits and unlocks a lot of other neat tricks.