Preface
This endeavor is deceptively simple: enabling Mojo programs to directly invoke wgpu-native to render a triangle.
It is termed “simple” because rendering a triangle constitutes the quintessential Hello World of graphics programming; its successful display signifies that the entire pipeline—from WGSL shader compilation and RenderPipeline instantiation to GPU command submission and window presentation—has been seamlessly integrated. Conversely, it is deemed “complex” because implementing C Foreign Function Interface (FFI) bindings within Mojo presents a labyrinth of subtle pitfalls at every juncture.
This article documents these intricacies.
Results
Executing pixi run hello yields a window displaying an RGB gradient triangle:
pixi run example-triangle
Beneath this visual lies a chain of operations driven entirely by Mojo: GLFW window initialization, wgpu Surface creation, ShaderModule compilation, RenderPipeline construction, draw calls, and final presentation. Furthermore, a compute shader example (pixi run example-compute) successfully performs vector addition on 1,024 elements on an RTX 3060, yielding correct results.
Usage
Prerequisites
- Mojo ≥ 0.26.3 (nightly)
- Pixi package manager
ffi/lib/libwgpu_native.so(the linux_x86_64 shared library for wgpu-native v29.0.0.0)
Quick Start
# Compile the C bridge layer (one-time setup)
pixi run build-callbacks
# Execute the triangle example
pixi run example-triangle
# Run non-GPU tests
pixi run test
Architectural Overview of the Hello Triangle
The WGSL shader is embedded directly within the Mojo source code as a comptime string constant:
comptime TRIANGLE_WGSL = """
struct VertexOutput {
@builtin(position) pos: vec4<f32>,
@location(0) color: vec3<f32>,
}
@vertex
fn vs_main(@builtin(vertex_index) idx: u32) -> VertexOutput {
var positions = array<vec2<f32>, 3>(
vec2<f32>( 0.0, 0.5),
vec2<f32>(-0.5, -0.5),
vec2<f32>( 0.5, -0.5),
);
var colors = array<vec3<f32>, 3>(
vec3<f32>(1.0, 0.0, 0.0),
vec3<f32>(0.0, 1.0, 0.0),
vec3<f32>(0.0, 0.0, 1.0),
);
var out: VertexOutput;
out.pos = vec4<f32>(positions[idx], 0.0, 1.0);
out.color = colors[idx];
return out;
}
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4<f32> {
return vec4<f32>(in.color, 1.0);
}
"""
The main program skeleton follows:
def main() raises:
var inst = request_adapter()
var device = inst.request_device()
var canvas = RenderCanvas(inst, device, 800, 600, "wgpu-mojo: hello triangle")
var shader = device.create_shader_module_wgsl(TRIANGLE_WGSL, "triangle")
# ... Pipeline construction ...
while canvas.is_open():
canvas.poll()
var frame = canvas.next_frame()
# ... Encode render pass, draw(3, 1, 0, 0), submit ...
canvas.present()
While the high-level API achieves a commendable degree of elegance, the true complexity resides within the omitted ellipses of the pipeline construction phase.
Project Structure
wgpu/ High-level RAII wrappers (Device, Buffer, Texture, Pipeline…)
wgpu/_ffi/
lib.mojo WGPULib — ~170 functions dispatched via OwnedDLHandle + dlsym
types.mojo Type aliases, enums, and bitflags
structs.mojo C struct layouts (descriptors, callback results)
ffi/
wgpu_callbacks.c C bridge layer
include/webgpu/ webgpu.h / wgpu.h headers
lib/ .so libraries
The wgpu-native API comprises approximately 170 functions, each requiring a manually crafted DLHandle.call[...] invocation in lib.mojo with exact type correspondence. All the subsequent challenges stem from this foundational layer.
Ten Pitfalls Encountered
1. DLHandle.call Returns Garbage for Structs Exceeding 16 Bytes
This represents the project’s most fundamental hurdle. Nearly every wgpu-native construction function accepts a large descriptor struct, such as WGPURenderPipelineDescriptor, WGPUBindGroupDescriptor, or WGPUComputePipelineDescriptor, which often span 40 to 120 bytes.
However, DLHandle.call fails to emit the byval attribute required by the SysV x86_64 ABI when passing structs larger than 16 bytes. Consequently, the C side receives random data from the stack.
Empirical verification (via a custom C library with printf statements):
16 bytes (2 × uint64): C receives a=10, b=20 → sum=30 ✓
24 bytes (3 × uint64): C receives a=108586124556112, b=488, c=8 → FAIL
Solution: All wgpu functions accepting large structs must be wrapped in a pointer-based version within the C bridge layer:
// Pattern in ffi/wgpu_callbacks.c
void wgpuDeviceCreateRenderPipeline_ptr(
WGPUDevice device,
const WGPURenderPipelineDescriptor *desc,
WGPURenderPipeline *out
) {
*out = wgpuDeviceCreateRenderPipeline(device, desc);
}
On the Mojo side, the struct is allocated on the heap via alloc[WGPURenderPipelineDescriptor](1), populated, and then passed by pointer. This is not a temporary workaround but the only viable approach under current constraints.
2. Inability to Pass Mojo Functions as C Callbacks
The asynchronous wgpu-native API relies heavily on C function pointer callbacks, such as WGPURequestDeviceCallback passed to wgpuAdapterRequestDevice.
In Mojo 0.26 nightly, there is no mechanism to obtain a C-compatible function pointer from a Mojo function. The fn keyword is deprecated in this version, and def functions lack address-of syntax.
Solution: Maintain a separate C file (wgpu_callbacks.c) defining static C callbacks and result structures, along with a getter function for Mojo to retrieve the callback pointer:
static void _wgpu_mojo_device_cb(
WGPURequestDeviceStatus status,
WGPUDevice device,
WGPUStringView message,
void* ud1, void* ud2
) {
MojoDeviceResult* r = (MojoDeviceResult*)ud1;
if (r) { r->device = (void*)device; r->status = (uint32_t)status; }
}
WGPURequestDeviceCallback wgpu_mojo_get_device_cb(void) {
return _wgpu_mojo_device_cb;
}
Mojo invokes the getter to retrieve the pointer and passes it to wgpu-native. One must write a dedicated wrapper for every distinct callback type; this is unavoidable.
3. ASAP Destruction: Objects Vanish After Their Last Use
Mojo’s “As Soon As Possible” (ASAP) destruction policy causes Movable types to be deallocated immediately after their final use. This clashes profoundly with the temporal semantics of GPU programming.
Consider the triangle example: creating a RenderPipeline requires passing handles to a PipelineLayout and ShaderModule. The Mojo compiler, observing the pl.handle() call, assumes pl’s lifecycle has ended. However, wgpu-native only accesses this handle internally during create_render_pipeline.
var pipeline = device.create_render_pipeline(desc)
// pl and shader are already ASAP-dropped before this line
// ↑ This may lead to crashes or validation errors
Solution: Utilize _ = var^ to explicitly pin the destruction of the consumed value:
var pipeline = device.create_render_pipeline(desc)
_ = pl^ // Defers pl's destruction until this line, ensuring safety
_ = shader^ // Same logic
This adheres to valid transfer semantics, not a hack. It necessitates a mental check at every .handle() call: “How long must this wrapper survive? Has the submission completed?”
Currently, the codebase contains a dozen such pins scattered across tests and examples.
4. Counterintuitive Initialization of List[T]
The instinctive approach to creating a list with a single element is:
var cmds = List[OpaquePtr](cmd) // Compilation Error: no matching function
This syntax invokes the constructor accepting capacity, not the one initializing from elements. The correct approach is:
var cmds: List[OpaquePtr] = [cmd] // Invokes __list_literal__, works correctly
However, a constraint exists: T must be Copyable. Callback info structs containing raw pointers (e.g., WGPUBufferMapCallbackInfo) do not implement Copyable and cannot be placed in a List; they require manual alloc/free management. This limitation, regrettably, stems from lingering Pythonic paradigms.
5. WGPUStringView Lifetime: A Silent Bomb
In wgpu-native v29, all string parameters were transitioned to WGPUStringView (pointer + length), replacing the legacy null-terminated const char*.
The str_to_sv function constructs a WGPUStringView from a Mojo String, holding a raw pointer to the string’s internal buffer. If the String’s lifetime is shorter than that of the WGPUStringView, a use-after-free occurs:
// Dangerous: String("vs_main") is a temporary value, prone to early destruction
var vs_entry = str_to_sv(String("vs_main"))
// Safe: The String is bound to a named variable with an explicit lifetime
var vs_name = String("vs_main")
var vs_entry = str_to_sv(vs_name)
This issue leaves no obvious trace in the code, surfacing only when the GPU receives garbled strings or triggers validation errors.
6. Enum Value Shifts from v27 to v29 Without Compile Errors
The wgpu-native v29 enum system inserted BindingNotUsed=0x0 at the beginning, causing all subsequent numeric values to shift by one. Code relying on hardcoded numeric constants silently fails; binding types mismatch, GPU validation errors occur, yet the error messages offer no clue regarding the enum values.
For instance, the Storage buffer binding type was 2 in v27 but became 3 in v29. Confirming such discrepancies requires consulting the webgpu.h source code.
7. wgpuQuerySetDestroy Causes Double-Free in v29
In v29, wgpuQuerySetDestroy internally performs the drop operation, removing the resource from the registry. Subsequent calls to wgpuQuerySetRelease result in a double-free and program crash. The semantics of these two functions differed in v27, causing immediate breakage upon upgrade.
Fix: QuerySet.__del__ should invoke Release exclusively, omitting Destroy.
8. Manual Compilation of wgpu_callbacks.c
This C bridge layer is not integrated into Pixi’s automated build pipeline (or requires a manual initial step):
gcc -shared -fPIC -o ffi/lib/libwgpu_mojo_cb.so ffi/wgpu_callbacks.c \
-Iffi/include -Lffi/lib -lwgpu_native -Wl,-rpath,'$ORIGIN'
Failure to compile this beforehand results in silent dlopen failures at runtime, with error messages that may not point to the missing .so file.
9. All Types Alias to OpaquePtr
Currently, all GPU handles (WGPUBufferHandle, WGPUTextureHandle, WGPUPipelineHandle, etc.) are aliased as the same OpaquePtr type within the type system. Passing an incorrect handle yields no compile-time errors; only runtime wgpu validation errors (if enabled) will surface the issue.
10. Redundant dlopen Calls in Every Wrapper
Each GPU wrapper struct (Device, Buffer, Texture, etc.) instantiates a WGPULib object within its __init__, which in turn calls OwnedDLHandle, effectively invoking dlopen every time. Consequently, creating a TextureView in the render loop triggers a dlopen per frame. While functional, this is not a sound design.
Recommendations for the Mojo Language
Having completed this binding, several areas where language-level support would significantly simplify C FFI implementation come to mind:
1. Correct DLHandle.call to Emit SysV ABI byval
The generation of erroneous LLVM IR for structs exceeding 16 bytes passed by value is the root cause. If DLHandle.call correctly emits the byval attribute, the entire workaround of allocating descriptors to the heap and passing pointers could be dismantled.
2. Provide Syntax for Obtaining C-Callable Function Pointers
A syntax akin to C’s &my_callback to obtain a C-compatible pointer from a Mojo function would allow direct passing to C APIs requiring callbacks, obviating the need for the entire wgpu_callbacks.c bridge layer.
3. Promote the Use of @align Decorator
The @align(N) decorator allows alignment requirements to be specified directly on struct types, ensuring both stack and heap allocations adhere to them. This is particularly vital for GPU descriptors requiring specific alignment (e.g., 64-byte alignment for certain hardware TMA descriptors), offering greater safety than manual specification in every alloc call.
4. Leverage the with Statement
Context managers (__enter__ / __exit__) are fully functional in the current nightly build, with __exit__ executing even upon errors. This offers a more structured alternative to _ = var^ pinning for scope-scoped resource cleanup. While current GPU wrappers cannot yet be wrapped as context managers due to holding non-copyable OwnedDLHandle, adjusting the ownership design would enable idioms like with device:.
5. Enhanced Lifetime and Borrowing Tools
Currently, developers must manually track “how long this wrapper needs to survive” at every .handle() call and insert _ = var^. A mechanism akin to Rust’s borrow checker, or even lighter-weight scope annotations, would allow such lifetime errors to be caught at compile time.
So…
This research was made possible by the exhaustive work already done in the WebGPU ecosystem. The existence of the WebGPU specification, wgpu-native, wgpu-rs, and wgpu-py provided the essential map for this exploration. Without these established resources, I wouldn’t even have a starting point to test the limits of Mojo’s FFI.
I also utilized Mojo AI Skills as a practical tool to handle the more tedious aspects of the binding process. It was efficient for generating boilerplate and sanity-checking Mojo’s evolving syntax, allowing me to focus on the actual architectural hurdles.
Ultimately, my role was to synthesize these resources, navigate the pitfalls, and bridge the gap—consuming a fair number of tokens along the way.
Moving forward, I intend to explore more elegant solutions for _ = var^ (perhaps via Context Managers) and investigate align(N). I also plan to experiment with leveraging MAX’s computational power to forge a path directly into the rendering pipeline, or perhaps by implementing an input system and developing a bindgen. Regardless, there remains much to learn. Until next night, Mojicians! ![]()
