Calling GPU Math Functions from Bitcode (CUDA libdevice/ROCm OCML)

leandrolcampos · October 11, 2025, 4:04pm

Hi everyone,

I’m interested in calling low-level math functions directly from NVIDIA’s CUDA libdevice (e.g., __nv_logf) and AMD’s ROCm OCML (e.g., __ocml_log_f32) within Mojo. My primary goal is to benchmark these specific vendor implementations.

For context, this is something I would typically do in Clang by using the compiler flags -Xclang -mlink-builtin-bitcode -Xclang ${libpath} to link the bitcode libraries. Similarly, Triton offers a mechanism for calling external functions, as described in this tutorial.

Is there an equivalent mechanism in Mojo to link against and call functions from these pre-compiled GPU bitcode libraries?

Thanks,

billyzhu · October 14, 2025, 5:21pm

There’s no “official” support yet but we do have an experimental feature right now that allows you to link bitcode libraries. In your mojo code, use the external_call function to invoke the function you expect from the bitcode library. The similar warnings for external_call apply: You need to make sure the mojo types used to specify the signature of the external function lower into the types specified by the function in the bitcode library (If the signatures don’t match, you’ll get a linker error eventually). Then, when building your mojo program, pass --bitcode-libs with the path to your .bc file. Again this is pretty experimental for now, so please bear with us as we iron out any problems before it reaches maturity .

leandrolcampos · October 15, 2025, 4:35am

Thanks so much for the help, Billy!

I tried using the --bitcode-libs flag as you suggested, but I’m still encountering a linker error. I was hoping you could help me spot what I might be doing wrong.

Here is a minimal reproducible example of the code I’m trying to run:

from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from layout import Layout, LayoutTensor
from math import ceildiv
from sys import external_call, has_accelerator

alias float_dtype = DType.float32
alias vector_size = 1000
alias layout = Layout.row_major(vector_size)

alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)

fn apply_logf(
    in_tensor: LayoutTensor[float_dtype, layout, ImmutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    var tid = block_idx.x * block_dim.x + thread_idx.x

    if tid < vector_size:
        # float __nv_logf(float);
        out_tensor[tid] = external_call["__nv_logf", Float32](
            rebind[Float32](in_tensor[tid])
        )

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()

        in_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

        for i in range(vector_size):
            in_host_buffer[i] = Scalar[float_dtype](i)

        print("Input buffer: ", in_host_buffer)

        in_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        ctx.enqueue_copy(dst_buf=in_device_buffer, src_buf=in_host_buffer)

        out_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        in_tensor = LayoutTensor[float_dtype, layout](in_device_buffer)
        out_tensor = LayoutTensor[float_dtype, layout](out_device_buffer)

        ctx.enqueue_function[apply_logf](
            in_tensor,
            out_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,
        )

        out_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )

        ctx.enqueue_copy(dst_buf=out_host_buffer, src_buf=out_device_buffer)
        ctx.synchronize()

        print("Output vector:", out_host_buffer)

When I try to build and run it with the suggested flag, I get the following ptxas fatal error:

$ mojo run --bitcode-libs /usr/local/cuda/nvvm/libdevice/libdevice.10.bc main.mojo

/home/leandro/project/libdevice/main.mojo:1:1: error: ptxas fatal   : Unresolved extern function '__nv_logf'

from gpu.host import DeviceContext
^
mojo: error: failed to run the pass manager

To double-check, I confirmed that the __nv_logf symbol does exist in the bitcode file using llvm-nm:

$ llvm-nm /usr/local/cuda/nvvm/libdevice/libdevice.10.bc | grep 'logf'
---------------- T __nv_fast_logf
---------------- T __nv_logf

Am I missing a step or perhaps using the flag incorrectly? Any insights would be greatly appreciated.

Thanks again!

Mojo version: Mojo 0.25.7.0.dev2025101405 (81e439c9)

billyzhu · October 15, 2025, 6:09pm

Hi Leandro,

Does mojo-run & mojo-build work the same for you? They should but could be a good sanity check.

One thing to check first is whether the target-triple used by the bitcode file is identical to the target-triple used by the accelerator target (as specified in the target info struct in mojo).

Another is to use the dump_llvm flag on enqueue_function to see the output llvm and verify if the signature of the function declaration for __nv_logf matches that in the bitcode file.

If nothing seems wrong, feel free to file an issue for this and we can take a closer look.

leandrolcampos · October 16, 2025, 2:21am

Thanks for the tips, Billy!

Your suggestions pointed me in the right direction, and I’ve found the root cause: a mismatch in the target triple and data layout between NVIDIA’s libdevice and the code Mojo generates.

Your suggestion to check the target triple was the key to solving this. Here’s what libdevice.10.bc reports:

$ llvm-dis -o - /usr/local/cuda/nvvm/libdevice/libdevice.10.bc | head -n 5
; ModuleID = '/usr/local/cuda/nvvm/libdevice/libdevice.10.bc'
source_filename = "/usr/local/cuda/nvvm/libdevice/libdevice.10.bc"
target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-gpulibs"

As you suspected, Mojo is targeting nvptx64-nvidia-cuda with a more detailed data layout: "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-i256:256-v16:16-v32:32-n16:32:64".

I was able to get it working with a manual workaround:

Disassemble the bitcode:

$ llvm-dis /usr/local/cuda/nvvm/libdevice/libdevice.10.bc -o libdevice.10.ll

Replace the target triple:

$ sed -i 's/target triple = "nvptx64-nvidia-gpulibs"/target triple = "nvptx64-nvidia-cuda"/' libdevice.10.ll

Replace the data layout (to suppress the linker warning):

$ sed -i 's/target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"/target datalayout = "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-i256:256-v16:16-v32:32-n16:32:64"/' libdevice.10.ll

(Note: The Mojo data layout appears to be a superset of libdevice’s; it adds new type mappings rather than changing existing ones, so this change seems safe.)

Reassemble the bitcode:

$ llvm-as libdevice.10.ll -o libdevice.cuda.10.bc

After applying these changes, the code links and runs successfully.

$ mojo run --bitcode-libs ./libdevice.cuda.10.bc main.mojo
Input buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
Output vector: HostBuffer([-inf, 0.0, 0.6931472, ..., 6.904751, 6.905753, 6.906755])

This got me curious about how LLVM/clang handles this automatically. A quick look into the LLVM repository confirmed our theory: there’s a special exception in the linker specifically for libdevice:

// From llvm/lib/Linker/IRMover.cpp

Error IRLinker::run() {
  // ...

  // During CUDA compilation we have to link with the bitcode supplied with
  // CUDA. libdevice bitcode either has no data layout set (pre-CUDA-11), or has
  // the layout that is different from the one used by LLVM/clang (it does not
  // include i128). Issuing a warning is not very helpful as there's not much
  // the user can do about it.
  bool EnableDLWarning = true;
  bool EnableTripleWarning = true;
  if (SrcTriple.isNVPTX() && DstTriple.isNVPTX()) {
    bool SrcHasLibDeviceDL =
        (SrcM->getDataLayoutStr().empty() ||
         SrcM->getDataLayoutStr() == "e-i64:64-v16:16-v32:32-n16:32:64");
    // libdevice bitcode uses nvptx64-nvidia-gpulibs or just
    // 'nvptx-unknown-unknown' triple (before CUDA-10.x) and is compatible with
    // all NVPTX variants.
    bool SrcHasLibDeviceTriple = (SrcTriple.getVendor() == Triple::NVIDIA &&
                                  SrcTriple.getOSName() == "gpulibs") ||
                                 (SrcTriple.getVendorName() == "unknown" &&
                                  SrcTriple.getOSName() == "unknown");
    EnableTripleWarning = !SrcHasLibDeviceTriple;
    EnableDLWarning = !(SrcHasLibDeviceTriple && SrcHasLibDeviceDL);
  }

  // ...
}

Given this, it seems the experimental --bitcode-libs feature in Mojo doesn’t yet include this specific exception for libdevice. Should I file a GitHub issue to track this?

Thanks again for all your help!

billyzhu · October 16, 2025, 5:54pm

That’s great! Glad that you were able to get it working with a bit of hacking.

We’re aware that nvidia has a carve out in LLVM for the data layout check, so we’re explicitly not checking that yet. The triple though is something we do need to check since we naturally compile for different targets during a single compilation. It should still follow whatever nvidia does upstream though, so please file an issue about rectifying the target check to accommodate for this nvidia special case.

Thanks for identifying this and finding the workaround!

Topic		Replies	Views
Error: failed to run the pass manager for offload functions Mojo gpu , mojo-compiler	3	138	October 4, 2025
CDNA2 / MI250X support? GPU Programming	2	38	March 8, 2026
Apple Silicon GPU support in Mojo GPU Programming	15	12523	March 11, 2026
How to get Mojo to detect AMD integrated GPU (APU)? GPU Programming gpu	16	164	March 12, 2026
[Hackathon] Experiment: CUDA Kernel → Mojo in Bitnet Community Showcase modular-hack-weekend	3	133	December 27, 2025

Calling GPU Math Functions from Bitcode (CUDA libdevice/ROCm OCML)

Related topics