Wgpu-mojo: wgpu-native bindings

,

Preface

This endeavor is deceptively simple: enabling Mojo programs to directly invoke wgpu-native to render a triangle.

It is termed “simple” because rendering a triangle constitutes the quintessential Hello World of graphics programming; its successful display signifies that the entire pipeline—from WGSL shader compilation and RenderPipeline instantiation to GPU command submission and window presentation—has been seamlessly integrated. Conversely, it is deemed “complex” because implementing C Foreign Function Interface (FFI) bindings within Mojo presents a labyrinth of subtle pitfalls at every juncture.

This article documents these intricacies.


Results

Executing pixi run hello yields a window displaying an RGB gradient triangle:

pixi run example-triangle

Beneath this visual lies a chain of operations driven entirely by Mojo: GLFW window initialization, wgpu Surface creation, ShaderModule compilation, RenderPipeline construction, draw calls, and final presentation. Furthermore, a compute shader example (pixi run example-compute) successfully performs vector addition on 1,024 elements on an RTX 3060, yielding correct results.


Usage

Prerequisites

  • Mojo ≥ 0.26.3 (nightly)
  • Pixi package manager
  • ffi/lib/libwgpu_native.so (the linux_x86_64 shared library for wgpu-native v29.0.0.0)

Quick Start

# Compile the C bridge layer (one-time setup)
pixi run build-callbacks

# Execute the triangle example
pixi run example-triangle

# Run non-GPU tests
pixi run test

Architectural Overview of the Hello Triangle

The WGSL shader is embedded directly within the Mojo source code as a comptime string constant:

comptime TRIANGLE_WGSL = """
struct VertexOutput {
    @builtin(position) pos: vec4<f32>,
    @location(0) color: vec3<f32>,
}

@vertex
fn vs_main(@builtin(vertex_index) idx: u32) -> VertexOutput {
    var positions = array<vec2<f32>, 3>(
        vec2<f32>( 0.0,  0.5),
        vec2<f32>(-0.5, -0.5),
        vec2<f32>( 0.5, -0.5),
    );
    var colors = array<vec3<f32>, 3>(
        vec3<f32>(1.0, 0.0, 0.0),
        vec3<f32>(0.0, 1.0, 0.0),
        vec3<f32>(0.0, 0.0, 1.0),
    );
    var out: VertexOutput;
    out.pos   = vec4<f32>(positions[idx], 0.0, 1.0);
    out.color = colors[idx];
    return out;
}

@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
    return vec4<f32>(in.color, 1.0);
}
"""

The main program skeleton follows:

def main() raises:
    var inst   = request_adapter()
    var device = inst.request_device()
    var canvas = RenderCanvas(inst, device, 800, 600, "wgpu-mojo: hello triangle")
    var shader = device.create_shader_module_wgsl(TRIANGLE_WGSL, "triangle")

    # ... Pipeline construction ...

    while canvas.is_open():
        canvas.poll()
        var frame = canvas.next_frame()
        # ... Encode render pass, draw(3, 1, 0, 0), submit ...
        canvas.present()

While the high-level API achieves a commendable degree of elegance, the true complexity resides within the omitted ellipses of the pipeline construction phase.


Project Structure

wgpu/           High-level RAII wrappers (Device, Buffer, Texture, Pipeline…)
wgpu/_ffi/
  lib.mojo      WGPULib — ~170 functions dispatched via OwnedDLHandle + dlsym
  types.mojo    Type aliases, enums, and bitflags
  structs.mojo  C struct layouts (descriptors, callback results)
ffi/
  wgpu_callbacks.c   C bridge layer
  include/webgpu/    webgpu.h / wgpu.h headers
  lib/               .so libraries

The wgpu-native API comprises approximately 170 functions, each requiring a manually crafted DLHandle.call[...] invocation in lib.mojo with exact type correspondence. All the subsequent challenges stem from this foundational layer.


Ten Pitfalls Encountered

1. DLHandle.call Returns Garbage for Structs Exceeding 16 Bytes

This represents the project’s most fundamental hurdle. Nearly every wgpu-native construction function accepts a large descriptor struct, such as WGPURenderPipelineDescriptor, WGPUBindGroupDescriptor, or WGPUComputePipelineDescriptor, which often span 40 to 120 bytes.

However, DLHandle.call fails to emit the byval attribute required by the SysV x86_64 ABI when passing structs larger than 16 bytes. Consequently, the C side receives random data from the stack.

Empirical verification (via a custom C library with printf statements):

16 bytes (2 × uint64): C receives a=10, b=20 → sum=30  ✓
24 bytes (3 × uint64): C receives a=108586124556112, b=488, c=8 → FAIL

Solution: All wgpu functions accepting large structs must be wrapped in a pointer-based version within the C bridge layer:

// Pattern in ffi/wgpu_callbacks.c
void wgpuDeviceCreateRenderPipeline_ptr(
    WGPUDevice device,
    const WGPURenderPipelineDescriptor *desc,
    WGPURenderPipeline *out
) {
    *out = wgpuDeviceCreateRenderPipeline(device, desc);
}

On the Mojo side, the struct is allocated on the heap via alloc[WGPURenderPipelineDescriptor](1), populated, and then passed by pointer. This is not a temporary workaround but the only viable approach under current constraints.

2. Inability to Pass Mojo Functions as C Callbacks

The asynchronous wgpu-native API relies heavily on C function pointer callbacks, such as WGPURequestDeviceCallback passed to wgpuAdapterRequestDevice.

In Mojo 0.26 nightly, there is no mechanism to obtain a C-compatible function pointer from a Mojo function. The fn keyword is deprecated in this version, and def functions lack address-of syntax.

Solution: Maintain a separate C file (wgpu_callbacks.c) defining static C callbacks and result structures, along with a getter function for Mojo to retrieve the callback pointer:

static void _wgpu_mojo_device_cb(
    WGPURequestDeviceStatus status,
    WGPUDevice device,
    WGPUStringView message,
    void* ud1, void* ud2
) {
    MojoDeviceResult* r = (MojoDeviceResult*)ud1;
    if (r) { r->device = (void*)device; r->status = (uint32_t)status; }
}

WGPURequestDeviceCallback wgpu_mojo_get_device_cb(void) {
    return _wgpu_mojo_device_cb;
}

Mojo invokes the getter to retrieve the pointer and passes it to wgpu-native. One must write a dedicated wrapper for every distinct callback type; this is unavoidable.

3. ASAP Destruction: Objects Vanish After Their Last Use

Mojo’s “As Soon As Possible” (ASAP) destruction policy causes Movable types to be deallocated immediately after their final use. This clashes profoundly with the temporal semantics of GPU programming.

Consider the triangle example: creating a RenderPipeline requires passing handles to a PipelineLayout and ShaderModule. The Mojo compiler, observing the pl.handle() call, assumes pl’s lifecycle has ended. However, wgpu-native only accesses this handle internally during create_render_pipeline.

var pipeline = device.create_render_pipeline(desc)
// pl and shader are already ASAP-dropped before this line
// ↑ This may lead to crashes or validation errors

Solution: Utilize _ = var^ to explicitly pin the destruction of the consumed value:

var pipeline = device.create_render_pipeline(desc)
_ = pl^      // Defers pl's destruction until this line, ensuring safety
_ = shader^  // Same logic

This adheres to valid transfer semantics, not a hack. It necessitates a mental check at every .handle() call: “How long must this wrapper survive? Has the submission completed?”

Currently, the codebase contains a dozen such pins scattered across tests and examples.

4. Counterintuitive Initialization of List[T]

The instinctive approach to creating a list with a single element is:

var cmds = List[OpaquePtr](cmd)   // Compilation Error: no matching function

This syntax invokes the constructor accepting capacity, not the one initializing from elements. The correct approach is:

var cmds: List[OpaquePtr] = [cmd]  // Invokes __list_literal__, works correctly

However, a constraint exists: T must be Copyable. Callback info structs containing raw pointers (e.g., WGPUBufferMapCallbackInfo) do not implement Copyable and cannot be placed in a List; they require manual alloc/free management. This limitation, regrettably, stems from lingering Pythonic paradigms.

5. WGPUStringView Lifetime: A Silent Bomb

In wgpu-native v29, all string parameters were transitioned to WGPUStringView (pointer + length), replacing the legacy null-terminated const char*.

The str_to_sv function constructs a WGPUStringView from a Mojo String, holding a raw pointer to the string’s internal buffer. If the String’s lifetime is shorter than that of the WGPUStringView, a use-after-free occurs:

// Dangerous: String("vs_main") is a temporary value, prone to early destruction
var vs_entry = str_to_sv(String("vs_main"))

// Safe: The String is bound to a named variable with an explicit lifetime
var vs_name  = String("vs_main")
var vs_entry = str_to_sv(vs_name)

This issue leaves no obvious trace in the code, surfacing only when the GPU receives garbled strings or triggers validation errors.

6. Enum Value Shifts from v27 to v29 Without Compile Errors

The wgpu-native v29 enum system inserted BindingNotUsed=0x0 at the beginning, causing all subsequent numeric values to shift by one. Code relying on hardcoded numeric constants silently fails; binding types mismatch, GPU validation errors occur, yet the error messages offer no clue regarding the enum values.

For instance, the Storage buffer binding type was 2 in v27 but became 3 in v29. Confirming such discrepancies requires consulting the webgpu.h source code.

7. wgpuQuerySetDestroy Causes Double-Free in v29

In v29, wgpuQuerySetDestroy internally performs the drop operation, removing the resource from the registry. Subsequent calls to wgpuQuerySetRelease result in a double-free and program crash. The semantics of these two functions differed in v27, causing immediate breakage upon upgrade.

Fix: QuerySet.__del__ should invoke Release exclusively, omitting Destroy.

8. Manual Compilation of wgpu_callbacks.c

This C bridge layer is not integrated into Pixi’s automated build pipeline (or requires a manual initial step):

gcc -shared -fPIC -o ffi/lib/libwgpu_mojo_cb.so ffi/wgpu_callbacks.c \
    -Iffi/include -Lffi/lib -lwgpu_native -Wl,-rpath,'$ORIGIN'

Failure to compile this beforehand results in silent dlopen failures at runtime, with error messages that may not point to the missing .so file.

9. All Types Alias to OpaquePtr

Currently, all GPU handles (WGPUBufferHandle, WGPUTextureHandle, WGPUPipelineHandle, etc.) are aliased as the same OpaquePtr type within the type system. Passing an incorrect handle yields no compile-time errors; only runtime wgpu validation errors (if enabled) will surface the issue.

10. Redundant dlopen Calls in Every Wrapper

Each GPU wrapper struct (Device, Buffer, Texture, etc.) instantiates a WGPULib object within its __init__, which in turn calls OwnedDLHandle, effectively invoking dlopen every time. Consequently, creating a TextureView in the render loop triggers a dlopen per frame. While functional, this is not a sound design.


Recommendations for the Mojo Language

Having completed this binding, several areas where language-level support would significantly simplify C FFI implementation come to mind:

1. Correct DLHandle.call to Emit SysV ABI byval
The generation of erroneous LLVM IR for structs exceeding 16 bytes passed by value is the root cause. If DLHandle.call correctly emits the byval attribute, the entire workaround of allocating descriptors to the heap and passing pointers could be dismantled.

2. Provide Syntax for Obtaining C-Callable Function Pointers
A syntax akin to C’s &my_callback to obtain a C-compatible pointer from a Mojo function would allow direct passing to C APIs requiring callbacks, obviating the need for the entire wgpu_callbacks.c bridge layer.

3. Promote the Use of @align Decorator
The @align(N) decorator allows alignment requirements to be specified directly on struct types, ensuring both stack and heap allocations adhere to them. This is particularly vital for GPU descriptors requiring specific alignment (e.g., 64-byte alignment for certain hardware TMA descriptors), offering greater safety than manual specification in every alloc call.

4. Leverage the with Statement
Context managers (__enter__ / __exit__) are fully functional in the current nightly build, with __exit__ executing even upon errors. This offers a more structured alternative to _ = var^ pinning for scope-scoped resource cleanup. While current GPU wrappers cannot yet be wrapped as context managers due to holding non-copyable OwnedDLHandle, adjusting the ownership design would enable idioms like with device:.

5. Enhanced Lifetime and Borrowing Tools
Currently, developers must manually track “how long this wrapper needs to survive” at every .handle() call and insert _ = var^. A mechanism akin to Rust’s borrow checker, or even lighter-weight scope annotations, would allow such lifetime errors to be caught at compile time.


So…

This research was made possible by the exhaustive work already done in the WebGPU ecosystem. The existence of the WebGPU specification, wgpu-native, wgpu-rs, and wgpu-py provided the essential map for this exploration. Without these established resources, I wouldn’t even have a starting point to test the limits of Mojo’s FFI.

I also utilized Mojo AI Skills as a practical tool to handle the more tedious aspects of the binding process. It was efficient for generating boilerplate and sanity-checking Mojo’s evolving syntax, allowing me to focus on the actual architectural hurdles.

Ultimately, my role was to synthesize these resources, navigate the pitfalls, and bridge the gap—consuming a fair number of tokens along the way.

Moving forward, I intend to explore more elegant solutions for _ = var^ (perhaps via Context Managers) and investigate align(N). I also plan to experiment with leveraging MAX’s computational power to forge a path directly into the rendering pipeline, or perhaps by implementing an input system and developing a bindgen. Regardless, there remains much to learn. Until next night, Mojicians! :magic_wand:

4 Likes

Please file a bug for this.

A mojo foo: def(Int) -> None should be equivalent to a void*(*foo)(ssize_t). If it’s not or you find other mismatches, please file bugs.

The solution to this is to use origins to notify the compiler that the handle requires information from pl.

For var cmds = List[OpaquePtr](cmd), the variadic constructor was causing some headaches, so we’ve moved using literals where possible as you found. The Copyable constraint is a result of conditional conformance not fully working, and that will be fixed in the future.

they require manual alloc/free management

Mojo offers both RAII (which as far as I’m aware is what C++ and Rust bindings use) as well as linear types, which can be used to force users to call a specific function later on. In general, I would expect that a library like this can make the API fully safe and difficult to cause memory leaks with.

You’ll need to hold an origin to fix this, look at the stdlib StringSpan for how to do that.

Sadly we can’t really help here until automatic binding generation works.

You should be able to use pixi tasks for this.

This should definitely be fixable using the “newtype” pattern, which would be a struct that only contains a single member and acts to enhance type safety.

You can do mojo build -Xlinker -l libfoo.so to link libfoo. You might want to look at doing that to cut down on dlopen if the library is mandatory.

I strongly agree on the first two. For the third one, you should always use aligns for anything that has alignment requirements. The capability to over-align exists for reasons of things like Float32, which you sometimes want 64-byte aligned for SIMD reasons but which doesn’t need to always be 64-byte aligned.

For 4 and 5, I think you missed that Mojo has a borrow checker (part of what powers the asap destruction), and you aren’t integrating with it while using unsafe constructs (OpaquePointer), which is causing a lot of your headaches. Rust will cause you similar pain if you tried to do this there, although you get some leniency due to drops only happening at scope closes. LLMs are really bad at utilizing it, especially around low level bindings like this, which might be why you missed it.

Overall

Overall, I think this is a fantastic project to have in the ecosystem, and I’m happy to help guide you towards more idiomatic mojo. I know the lower level, ffi heavy stuff hasn’t been documented as well as it should have been, so this may be a sign we need better docs there.

1 Like

I think the issue is already tracked and being worked on. Using the linker and external_call should help, but works only with mojo compile, not with mojo run.

1 Like

Thanks for the feedback. I’ll be honest—the origin and borrowing system in Mojo is still a significant learning curve for me. I’m currently digging into the stdlib and StringSpan to better understand how to implement this correctly.

The C callback implementation was indeed a bit of a naive placeholder. It was the most intuitive way for me to get a working prototype, but I realize now it needs a more native approach to handle Mojo’s requirements properly.

Regarding the issues already being tracked, I’m going to try reproducing them. I’m not entirely sure if I can isolate the ABI issues on my current setup, but I’ll jump into the GitHub thread to provide whatever logs or context I can find.

Appreciate the guidance as I navigate these lower-level details!

I’ll be honest—the origin and borrowing system in Mojo is still a significant learning curve for me.

You can treat it as a superset of how Rust does things, albeit with a requirement that you use Pointer[T, origin] instead of &'origin T. That should give you a lot more learning materials to work with. Sadly, origins currently aren’t explained super well and the biggest place you deal with all of the footguns is in FFI code like this, so you sort-of dove into the deep end here.

1 Like