Question: vpermi2b inline assembly output incorrect in loop context due to register allocation

Yo-hey · March 21, 2025, 9:08am

Hi all,

I’m running into a register allocation issue when using inline assembly with vpermi2b. Here’s a minimal function that appears to work as intended when used in small, isolated cases:

@always_inline("nodebug")
fn my_vpermi2b(
    a: SIMD[DType.int8, 64],
    b: SIMD[DType.int8, 64],
    idx: SIMD[DType.int8, 64]
) -> SIMD[DType.int8, 64]:
    var scratch = SIMD[DType.int8,64]()
    var result = inlined_assembly[
        """
        vpermi2b $2, $1, $3
        """,
        SIMD[DType.int8, 64],
        constraints="=v,v,v,v",
        has_side_effect=True
    ](a, b, idx, scratch)

    return result

The goal is to implement a wrapper for the AVX-512 vpermi2b instruction.
I noticed that the function behaves correctly only in small test cases, but when placed inside a larger loop, or used in more complex control flow, the output becomes incorrect — it appears to be due to incorrect register allocation or register reuse by the compiler backend.

I suspect this is because I’m not expressing the read/write behavior of the destination register properly. In theory, this instruction should require only three operands — a, b, and idx — and using a read/write constraint like +v on the destination should be sufficient. But when I do that, the function becomes crashes.

Adding a fourth operand (as a dummy) seems to stabilize the allocation, possibly forcing the compiler to treat the destination register correctly. But this workaround feels fragile and could have unwanted side effects, especially in performance-sensitive code.

My question:

Is there a correct way to express vpermi2b in Mojo’s inlined_assembly using only three operands with proper read/write constraint?
Is the =v,v,v,v constraint idiom considered valid or a workaround?
How can I ensure stability across optimization passes, particularly in loops, without relying on dummy operands or inserting unrelated instructions (e.g., print)?

Any insight from those familiar with Mojo’s LLVM integration or inline assembly behavior would be greatly appreciated.

Thanks!

sora · March 23, 2025, 9:53pm

Maybe you could try this

from sys import llvm_intrinsic

alias T = SIMD[DType.int8, 64]

@always_inline("nodebug")
fn vpermi2b(a: T, b: T, idx: T) -> T:
  return llvm_intrinsic["llvm.x86.avx512.vpermi2var.qi.512", T](a, idx, b)

Yo-hey · March 24, 2025, 1:41am

Thank you very much for the suggestion!
Calling the vpermi2b instruction via llvm_intrinsic["llvm.x86.avx512.vpermi2var.qi.512"] works perfectly and is much more stable than inline assembly.
I really appreciate your help — this solves the issue I was struggling with.

sora · March 24, 2025, 11:01am

I might have got the argument order wrong (in fact I’m still not sure). It’s really quite confusing.

system · March 31, 2025, 11:02am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LLVM intrinsic problem Mojo debugging	7	170	April 4, 2025
I have discovered a suspect efficiency anomaly in the mojo compiler, how to proceed? Mojo discussion , mojo-compiler , 25_1	20	236	March 8, 2025
How to iterate over a `List` using `SIMD` in Mojo Mojo docs	8	159	December 9, 2024
Input stdlib function Mojo	6	63	December 9, 2024
Can't bitcast `uint8` SIMD vector to `bool` Mojo debugging , mojo-compiler , 24_6	3	86	December 31, 2024

Question: vpermi2b inline assembly output incorrect in loop context due to register allocation

My question:

Related topics