[Informal Proposal] Portable SIMT for Mojo

I’ve been thinking a bit about SIMT and a problem we have in Mojo, namely that we have “GPU functions” and “CPU functions”, where code won’t work correctly on CPU with no real indication of why unless you know it’s supposed to be GPU only. This extends into kernels, where the kernel team currently has to do a near complete rewrite of a lot of parts of many kernels.

Similarly, on a CPU I think we’re going to see some problems result from a lack of autovec in more complex scenarios, where many developers might not think to use SIMD.

I think one potential solution to this is making SIMT and SIMD more explicit. Intel’s ISPC has managed this with great success, and it does a very good job of generating code for both CPU vector units and GPUs. However, it has this rather annoying limitation that it wants you to specify the lane width and count at a program level. I think Mojo could fix that. Consider the following:

def simt_elementwise_add[dtype: DType, size: Int](
    uniform read a: InlineArray[Scalar[dtype], size], 
    uniform read b: InlineArray[Scalar[dtype], size], 
    uniform mut c: InlineArray[Scalar[dtype], size],
):
    simt_for[simd_width_of[dtype]()] i in range(size): #Enter 1d SIMT loop with `simd_width_of[dtype]()` lanes
        # On an AVX512 CPU, fp32 would be 16 lanes here.
        # NVIDIA GPU would be 32 lanes total, but distributed across the native threads (codegen for Scalar)
        c[i] = a[i] + b[i] # Naive compiler impl uses scatter/gather on AVX512, we can provide some intrinsics for contiguous loads here.

def simd_elementwise_add[dtype: DType, size: Int](
    read a: InlineArray[Scalar[dtype], size], 
    read b: InlineArray[Scalar[dtype], size], 
    mut c: InlineArray[Scalar[dtype], size],
):
    comptime width = simd_width_of[dtype]()
    for i in range(align_down(size, width) / width):
        c.store(i, a.load[width=width](i) + b.load[width=width](i))
    for i in range(size % width):
        var index = align_down(size, width) + i
        c[index] = a[index] + b[index]

Here’s actual working code for Intel’s ISPC compiler:

void simt_elementwise_add(const uniform const float a[], const uniform const float b[], uniform float c[], uniform size_t n) {
    foreach (i = 0 ... n) {
        c[i] = a[i] + b[i];
    }
}

The hard part here is the codegen for SIMT → SIMD, since going the other way is a lot easier. ISPC provides a lot of prior art here.

The other interesting thing this can do is help manage “SIMD in SIMT”:

def simt_elementwise_add_pairwise[dtype: DType, size: Int](
    uniform read a: InlineArray[Scalar[dtype], size], 
    uniform read b: InlineArray[Scalar[dtype], size], 
    uniform mut c: InlineArray[Scalar[dtype], size / 2],
):
    simt_for[simd_width_of[dtype]() * 2] i in range(0, size):
        comptime assert simd_width_of[dtype]() == 2, "Behaves exactly like packed fp8/fp16 on GPUs!"
        c[i] = (a.load[width=2](i * 2) + b.load[width=2](i * 2)).reduce_sum()

The idea of uniform and varying are taken from ISPC. uniform means it’s the same across all lanes, varying means that it’s different on different lanes. const uniform const float a[] is effectively a way to spell const* const float a in C parlance.

The next task is to fix the issue of function coloring . Since we may need parameters to calculate the lane width, this has to go next to raises (a position I think async may need to go in eventually as well). Additionally, I think following apple here and using function parameters makes more sense since it means there’s no implicit state being passed around (we want to avoid closures again).

def simt_elementwise_add[dtype: DType, size: Int](
    uniform read a: InlineArray[Scalar[dtype], size], 
    uniform read b: InlineArray[Scalar[dtype], size], 
    uniform mut c: InlineArray[Scalar[dtype], size],
    @thread_idx.x # No idea what this syntax should be 
    thread_idx_x: Int, 
) simt[simd_width_of[dtype]()]: 
    c[thread_idx_x] = a[thread_idx_x] + b[thread_idx_x]

In order to launch this, you’d do something like this:

def elementwise_add[dtype: DType, size: Int](
    read a: InlineArray[Scalar[dtype], size], 
    read b: InlineArray[Scalar[dtype], size], 
    mut c: InlineArray[Scalar[dtype], size],
): #no simt effect, so it's a normal function
    simt_vectorize[simt_elementwise_add[dtype, size]](a, b, c, size) # like vectorize, but takes a `simt` function and handles thread_idx-like things

Here’s an example of passing varying data:

def simt_elementwise_add[dtype: DType, width: Int](
    varying read a: Scalar[dtype], 
    varying read b: Scalar[dtype], 
    uniform read alpha: Scalar[dtype],
    varying out c: Scalar[dtype],
) simt[width]: 
   c = a + b + alpha

def elementwise_add[dtype: DType, size: Int](
    read a: InlineArray[Scalar[dtype], size], 
    read b: InlineArray[Scalar[dtype], size], 
    read alpha: Scalar[dtype],
    mut c: InlineArray[Scalar[dtype], size],
): #no simt effect, so it's a normal function
    def kernel[width: Int](i: Int) unified {mut}:
        c.store[width=width](i, simt_elementwise_add[dtype, width](a.load[width=width](i), b.load[width=width](i), alpha))

    vectorize[kernel](size, kernel)

I’d ideally like to fix what I see as a potential major issue before 1.0, since I think this will continue to harm Mojo’s portability, and I’ve noticed in community code that a people either tend to go “full SIMD” or fully scalar, with not a ton in-between, which harms performance. I think some of that is related to the mental shift needed to use even portable SIMD, much less more direct intrinsics. Given that autovec is highly unreliable, I think “inline SIMT” is a good compromise for more portable ways to get good performance. This should also be useful for managing complex packed datatypes on GPUs, given that fp4 on NVIDIA GPUs is 8-wide per lane. The migration for existing GPU code should be simple, paste simt[32] or simt[64] on every GPU-only function as appropriate, with a version that picks based on the target for “gpu-portable” functions.

The minimal fix should be to add the simt effect, and for now make everything that changes the lane count not work. When you’re inside of a SIMT lane, you can of course call simt[1] functions because it’s the same width as the current context, so existing GPU code would still work. This at least papers over the problem of “silently wrong” code that assumes it’s in a SIMT environment until it can be addressed properly later. One thing I’m not sure about is how to manage something like calling into a SIMD kernel that’s a single lane from a 32-lane function, so I think there might need to be a special case for pushing a “reset” onto the stack of widths.

1 Like

@denis I’d like your thoughts here since these are non-trivial compiler changes.

1 Like

I agree that:

  • SIMT has great ergonomics, and it would be great if Mojo supported a target-independent SIMT programming model.
  • Intel has demonstrated, with ISPC, that it’s possible to compile SIMT to CPU SIMD instructions.

If Mojo was to support target-independent SIMT, this would need to be part of the near-to-mid term roadmap, because it will require deep language support. Mojo would need “SIMT-ready” function definitions (a new form of coloring). These functions would have restrictions that regular functions don’t. It’s not clear whether this would be viable in practice. Would Mojo need to duplicate its standard library, offering both SIST and SIMT variants of functions?

I would like to see somebody explore this idea, but it will certainly be a big undertaking.

I suppose—as you say in your post Owen—that Mojo already has the problem of “CPU functions” vs. “GPU functions”. So perhaps in practice, replacing this with “SIST functions” vs. “SIMT functions” would offer better portability, without introducing a new source of code duplication.

Mojo would need “SIMT-ready” function definitions (a new form of coloring).

I’d argue we already have coloring here, because we have a lot of functions which simply may not work properly on any CPU, either due to using thread_idx or do to making SIMT-based assumptions like that there are another 31 threads present.

Would Mojo need to duplicate its standard library, offering both SIST and SIMT variants of functions?

The initial proposal would be to enable SIMT functions to expose a SIMD-based interface (see the example of passing varying data). If anything, I’d argue that this actually removes the need for _gpu and _cpu variants of functions and reduces duplication, at least in the long term.

SIST functions

The plan is to still enable SIMD on CPU. There are certain things which are somewhat difficult to in SIMT, especially some cases where you know it should be a contiguous load but maybe the compiler can’t figure that out. Additionally, we already need to support “SIMD in SIMT” to deal with things like packed FP8 on GPUs, at which point we can just make it so that CPUs have SIMT=1, SIMD=vlen by default, but you can make any SIMT=N, SIMD=vlen / N combination that the CPU has support for. In fact, a secondary goal of this endeavor is to offer a “better autovec” for people who don’t want to program at the SIMD level, one could almost call it “democratizing CPU compute”. Intel’s ISPC is a nice tool, but it’s very limited and not really a general purpose programming language by modern standards. It more or less expects you to write your hot loops in it and then have C or C++ do all of the actual work. Additionally, ISPC requires you to set SIMT information library wide, which is unnecessarily restrictive in my opinion since different kernels may warrant different SIMT widths (ex: fp8 kernel vs fp16 vs fp32 vs fp64).