Incorrect vector register size selection in AVX-512 inline assembly

I am trying the following code to see if I can use the AVX-512 k-register to perform the operation. I am having trouble with an error in the register assignment. Is there a good workaround?

When using AVX-512 instructions with Mojo’s inline assembler, the compiler incorrectly assigns the vector operand to a 128-bit (xmm) register instead of the expected 512-bit (zmm) register, despite the explicit declaration of a 512-bit vector (SIMD[DType.int16, 32]).

Reproduction steps:

  1. Define a function using SIMD[DType.int16, 32]:
from sys._assembly import inlined_assembly

@always_inline("nodebug")
fn cmpgt_epi16_mask(val:SIMD[DType.int16, 32], rsh:SIMD[DType.int16, 32]) -> UInt32:
    var mask: UInt32 = 0
    mask = inlined_assembly[
        """
        vpcmpw $$5, $1,$2, $0
        """,
        UInt32,
        constraints = "=k,v,v",
        #        constraints = "=k,z,z",
        has_side_effect = False
    ](mask, val, rsh)
    return mask


fn main():
    var vec1 = SIMD[DType.int16, 32](0)
    var vec2 = SIMD[DType.int16, 32](0)

    for i in range(0,32):
        vec1[i] = Int16(i-16)
        vec2[i] = Int16(16-i)

    print(vec1,vec2,hex(Int(cmpgt_epi16_mask(vec1,vec2))))

  1. Compile and observe the resulting error:
error: <inline asm="">:2:26: invalid operand for instruction
        vpcmpw $5, %xmm0,%zmm1, %k0
                         ^~~~~

Expected behavior:
The compiler should assign both operands (val and rsh) to zmm registers, consistent with their declared vector size (512 bits).

Actual behavior:
The compiler incorrectly assigns one operand to a 128-bit (xmm) register, causing an instruction mismatch.

Self-follow-up: Root cause of LLVM IR generation bug in Mojo

Upon further investigation of this issue, I discovered the root cause by analyzing the LLVM IR generated by Mojo for the inlined_assembly call.

Mojo incorrectly adds an extra dummy argument (i32 0) to the LLVM IR call, causing a mismatch between operands and constraints. This directly leads to the originally reported “Incorrect vector register size selection” issue.

:cross_mark: Incorrect LLVM IR (Mojo-generated):

%22 = call i32 asm "\0A        vpcmpnltw $2, $1, $0\0A        ", "=k,v,v"(i32 0, <32 x i16> %20, <32 x i16> %21)

:white_check_mark: Correct LLVM IR (Expected):

%22 = call i32 asm "\0A        vpcmpnltw $2, $1, $0\0A        ", "=k,v,v"(<32 x i16> %20, <32 x i16> %21)

This behavior is a bug in Mojo’s LLVM IR generation logic for inline assembly, which needs to be addressed in Mojo’s compiler.

While this is a compiler bug, vec1 > vec2 generates the correct assembly and performs the equivalent operation to cmpgt_epi16_mask, at least for my zen 4 system.

Please report the compiler bug on github with the reproduction.

I’m not convinced this is actually a bug, Mojo is simply translating your code literally:

mask = inlined_assembly[
  "vpcmpnltw $2,$1,$0",
  UInt32,
  constraints="=k,v,v",
  has_side_effect=False,
](mask, val, rsh)  # passing three arguments

The generated LLVM looks like this:

%22 = call i32 asm "vpcmpnltw $2,$1,$0", "=k,v,v"(i32 0, <32 x i16> %20, <32 x i16> %21)

Notice how the types line up with your mask, val, and rsh arguments.

Since SSA form requires a return value, you might try:

@always_inline("nodebug")
fn cmpgt_epi16_mask(
  val: SIMD[DType.int16, 32],
  rsh: SIMD[DType.int16, 32],
) -> UInt32:
  return inlined_assembly[
    "vpcmpnltw $2,$1,$0",
    UInt32,
    constraints="=k,v,v",
    has_side_effect=False,
  ](val, rsh)

I think the argument convention (“modifying mask”) is already captured by "=k,v,v".

1 Like

Here are a few other ideas (untested on my machine, which doesn’t support AVX-512):

You could use:

llvm_intrinsic["llvm.x86.avx512.mask.cmp.w.512", ...]

instead (see ref).

Alternatively, I think the following will produce the expected IR:

from memory import pack_bits

alias T = SIMD[DType.int16, 32]

@always_inline("nodebug")
fn gt(a: T, b: T) -> UInt32:
  return pack_bits(a > b)

This generates:

define dso_local noundef i32 @"gt"(<32 x i16> noundef %0, <32 x i16> noundef %1) #0 {
  %3 = icmp sgt <32 x i16> %0, %1
  %4 = bitcast <32 x i1> %3 to i32
  ret i32 %4
}

which should lower to a single vpcmpnltw instruction.