Hi all,
I’m running into a register allocation issue when using inline assembly with vpermi2b
. Here’s a minimal function that appears to work as intended when used in small, isolated cases:
@always_inline("nodebug")
fn my_vpermi2b(
a: SIMD[DType.int8, 64],
b: SIMD[DType.int8, 64],
idx: SIMD[DType.int8, 64]
) -> SIMD[DType.int8, 64]:
var scratch = SIMD[DType.int8,64]()
var result = inlined_assembly[
"""
vpermi2b $2, $1, $3
""",
SIMD[DType.int8, 64],
constraints="=v,v,v,v",
has_side_effect=True
](a, b, idx, scratch)
return result
The goal is to implement a wrapper for the AVX-512 vpermi2b
instruction.
I noticed that the function behaves correctly only in small test cases, but when placed inside a larger loop, or used in more complex control flow, the output becomes incorrect — it appears to be due to incorrect register allocation or register reuse by the compiler backend.
I suspect this is because I’m not expressing the read/write behavior of the destination register properly. In theory, this instruction should require only three operands — a
, b
, and idx
— and using a read/write constraint like +v
on the destination should be sufficient. But when I do that, the function becomes crashes.
Adding a fourth operand (as a dummy) seems to stabilize the allocation, possibly forcing the compiler to treat the destination register correctly. But this workaround feels fragile and could have unwanted side effects, especially in performance-sensitive code.
My question:
- Is there a correct way to express
vpermi2b
in Mojo’sinlined_assembly
using only three operands with proper read/write constraint? - Is the
=v,v,v,v
constraint idiom considered valid or a workaround? - How can I ensure stability across optimization passes, particularly in loops, without relying on dummy operands or inserting unrelated instructions (e.g.,
print
)?
Any insight from those familiar with Mojo’s LLVM integration or inline assembly behavior would be greatly appreciated.
Thanks!