I’ve been thinking a bit about SIMT and a problem we have in Mojo, namely that we have “GPU functions” and “CPU functions”, where code won’t work correctly on CPU with no real indication of why unless you know it’s supposed to be GPU only. This extends into kernels, where the kernel team currently has to do a near complete rewrite of a lot of parts of many kernels.
Similarly, on a CPU I think we’re going to see some problems result from a lack of autovec in more complex scenarios, where many developers might not think to use SIMD.
I think one potential solution to this is making SIMT and SIMD more explicit. Intel’s ISPC has managed this with great success, and it does a very good job of generating code for both CPU vector units and GPUs. However, it has this rather annoying limitation that it wants you to specify the lane width and count at a program level. I think Mojo could fix that. Consider the following:
def simt_elementwise_add[dtype: DType, size: Int](
uniform read a: InlineArray[Scalar[dtype], size],
uniform read b: InlineArray[Scalar[dtype], size],
uniform mut c: InlineArray[Scalar[dtype], size],
):
simt_for[simd_width_of[dtype]()] i in range(size): #Enter 1d SIMT loop with `simd_width_of[dtype]()` lanes
# On an AVX512 CPU, fp32 would be 16 lanes here.
# NVIDIA GPU would be 32 lanes total, but distributed across the native threads (codegen for Scalar)
c[i] = a[i] + b[i] # Naive compiler impl uses scatter/gather on AVX512, we can provide some intrinsics for contiguous loads here.
def simd_elementwise_add[dtype: DType, size: Int](
read a: InlineArray[Scalar[dtype], size],
read b: InlineArray[Scalar[dtype], size],
mut c: InlineArray[Scalar[dtype], size],
):
comptime width = simd_width_of[dtype]()
for i in range(align_down(size, width) / width):
c.store(i, a.load[width=width](i) + b.load[width=width](i))
for i in range(size % width):
var index = align_down(size, width) + i
c[index] = a[index] + b[index]
Here’s actual working code for Intel’s ISPC compiler:
void simt_elementwise_add(const uniform const float a[], const uniform const float b[], uniform float c[], uniform size_t n) {
foreach (i = 0 ... n) {
c[i] = a[i] + b[i];
}
}
The hard part here is the codegen for SIMT → SIMD, since going the other way is a lot easier. ISPC provides a lot of prior art here.
The other interesting thing this can do is help manage “SIMD in SIMT”:
def simt_elementwise_add_pairwise[dtype: DType, size: Int](
uniform read a: InlineArray[Scalar[dtype], size],
uniform read b: InlineArray[Scalar[dtype], size],
uniform mut c: InlineArray[Scalar[dtype], size / 2],
):
simt_for[simd_width_of[dtype]() * 2] i in range(0, size):
comptime assert simd_width_of[dtype]() == 2, "Behaves exactly like packed fp8/fp16 on GPUs!"
c[i] = (a.load[width=2](i * 2) + b.load[width=2](i * 2)).reduce_sum()
The idea of uniform and varying are taken from ISPC. uniform means it’s the same across all lanes, varying means that it’s different on different lanes. const uniform const float a[] is effectively a way to spell const* const float a in C parlance.
The next task is to fix the issue of function coloring . Since we may need parameters to calculate the lane width, this has to go next to raises (a position I think async may need to go in eventually as well). Additionally, I think following apple here and using function parameters makes more sense since it means there’s no implicit state being passed around (we want to avoid closures again).
def simt_elementwise_add[dtype: DType, size: Int](
uniform read a: InlineArray[Scalar[dtype], size],
uniform read b: InlineArray[Scalar[dtype], size],
uniform mut c: InlineArray[Scalar[dtype], size],
@thread_idx.x # No idea what this syntax should be
thread_idx_x: Int,
) simt[simd_width_of[dtype]()]:
c[thread_idx_x] = a[thread_idx_x] + b[thread_idx_x]
In order to launch this, you’d do something like this:
def elementwise_add[dtype: DType, size: Int](
read a: InlineArray[Scalar[dtype], size],
read b: InlineArray[Scalar[dtype], size],
mut c: InlineArray[Scalar[dtype], size],
): #no simt effect, so it's a normal function
simt_vectorize[simt_elementwise_add[dtype, size]](a, b, c, size) # like vectorize, but takes a `simt` function and handles thread_idx-like things
Here’s an example of passing varying data:
def simt_elementwise_add[dtype: DType, width: Int](
varying read a: Scalar[dtype],
varying read b: Scalar[dtype],
uniform read alpha: Scalar[dtype],
varying out c: Scalar[dtype],
) simt[width]:
c = a + b + alpha
def elementwise_add[dtype: DType, size: Int](
read a: InlineArray[Scalar[dtype], size],
read b: InlineArray[Scalar[dtype], size],
read alpha: Scalar[dtype],
mut c: InlineArray[Scalar[dtype], size],
): #no simt effect, so it's a normal function
def kernel[width: Int](i: Int) unified {mut}:
c.store[width=width](i, simt_elementwise_add[dtype, width](a.load[width=width](i), b.load[width=width](i), alpha))
vectorize[kernel](size, kernel)
I’d ideally like to fix what I see as a potential major issue before 1.0, since I think this will continue to harm Mojo’s portability, and I’ve noticed in community code that a people either tend to go “full SIMD” or fully scalar, with not a ton in-between, which harms performance. I think some of that is related to the mental shift needed to use even portable SIMD, much less more direct intrinsics. Given that autovec is highly unreliable, I think “inline SIMT” is a good compromise for more portable ways to get good performance. This should also be useful for managing complex packed datatypes on GPUs, given that fp4 on NVIDIA GPUs is 8-wide per lane. The migration for existing GPU code should be simple, paste simt[32] or simt[64] on every GPU-only function as appropriate, with a version that picks based on the target for “gpu-portable” functions.
The minimal fix should be to add the simt effect, and for now make everything that changes the lane count not work. When you’re inside of a SIMT lane, you can of course call simt[1] functions because it’s the same width as the current context, so existing GPU code would still work. This at least papers over the problem of “silently wrong” code that assumes it’s in a SIMT environment until it can be addressed properly later. One thing I’m not sure about is how to manage something like calling into a SIMD kernel that’s a single lane from a 32-lane function, so I think there might need to be a special case for pushing a “reset” onto the stack of widths.