Thanks for the tips, Billy!
Your suggestions pointed me in the right direction, and I’ve found the root cause: a mismatch in the target triple and data layout between NVIDIA’s libdevice and the code Mojo generates.
Your suggestion to check the target triple was the key to solving this. Here’s what libdevice.10.bc reports:
$ llvm-dis -o - /usr/local/cuda/nvvm/libdevice/libdevice.10.bc | head -n 5
; ModuleID = '/usr/local/cuda/nvvm/libdevice/libdevice.10.bc'
source_filename = "/usr/local/cuda/nvvm/libdevice/libdevice.10.bc"
target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-gpulibs"
As you suspected, Mojo is targeting nvptx64-nvidia-cuda with a more detailed data layout: "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-i256:256-v16:16-v32:32-n16:32:64".
I was able to get it working with a manual workaround:
- Disassemble the bitcode:
$ llvm-dis /usr/local/cuda/nvvm/libdevice/libdevice.10.bc -o libdevice.10.ll
- Replace the target triple:
$ sed -i 's/target triple = "nvptx64-nvidia-gpulibs"/target triple = "nvptx64-nvidia-cuda"/' libdevice.10.ll
- Replace the data layout (to suppress the linker warning):
$ sed -i 's/target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"/target datalayout = "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-i256:256-v16:16-v32:32-n16:32:64"/' libdevice.10.ll
(Note: The Mojo data layout appears to be a superset of libdevice’s; it adds new type mappings rather than changing existing ones, so this change seems safe.)
- Reassemble the bitcode:
$ llvm-as libdevice.10.ll -o libdevice.cuda.10.bc
After applying these changes, the code links and runs successfully.
$ mojo run --bitcode-libs ./libdevice.cuda.10.bc main.mojo
Input buffer: HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
Output vector: HostBuffer([-inf, 0.0, 0.6931472, ..., 6.904751, 6.905753, 6.906755])
This got me curious about how LLVM/clang handles this automatically. A quick look into the LLVM repository confirmed our theory: there’s a special exception in the linker specifically for libdevice:
// From llvm/lib/Linker/IRMover.cpp
Error IRLinker::run() {
// ...
// During CUDA compilation we have to link with the bitcode supplied with
// CUDA. libdevice bitcode either has no data layout set (pre-CUDA-11), or has
// the layout that is different from the one used by LLVM/clang (it does not
// include i128). Issuing a warning is not very helpful as there's not much
// the user can do about it.
bool EnableDLWarning = true;
bool EnableTripleWarning = true;
if (SrcTriple.isNVPTX() && DstTriple.isNVPTX()) {
bool SrcHasLibDeviceDL =
(SrcM->getDataLayoutStr().empty() ||
SrcM->getDataLayoutStr() == "e-i64:64-v16:16-v32:32-n16:32:64");
// libdevice bitcode uses nvptx64-nvidia-gpulibs or just
// 'nvptx-unknown-unknown' triple (before CUDA-10.x) and is compatible with
// all NVPTX variants.
bool SrcHasLibDeviceTriple = (SrcTriple.getVendor() == Triple::NVIDIA &&
SrcTriple.getOSName() == "gpulibs") ||
(SrcTriple.getVendorName() == "unknown" &&
SrcTriple.getOSName() == "unknown");
EnableTripleWarning = !SrcHasLibDeviceTriple;
EnableDLWarning = !(SrcHasLibDeviceTriple && SrcHasLibDeviceDL);
}
// ...
}
Given this, it seems the experimental --bitcode-libs feature in Mojo doesn’t yet include this specific exception for libdevice. Should I file a GitHub issue to track this?
Thanks again for all your help!