Zero-copy GPU interop: export/import FD API for Mojo DeviceBuffer + DeviceContext

nilor-corp_lucas · August 23, 2025, 7:29am

Hello,

I would like to build a real time AI engine on mojo. To do this I will need to be able to zero copy from the mojo allocated device buffer into GL/Vulkan. To the best of my knowledge I could not figure out a way to achieve this, which makes the prospect of a real time (visual) engine a non-starter.

A possible approach (and i have no idea how difficult this would be on the runtime side) would be to to expose FD-based interop—exporting GPU memory as a POSIX fd and signaling a timeline semaphore fd—so GL/VK can import and present the same buffer in zero-copy.

Here is a proposed one pager (I relied heavily on different AI agents to get to this point, so even a sanity check from someone down in the weeds would be helpful at this point. I’ve done my best to verify these findings with the available source/docs/discord knowledge bases as well.)

Absolutely—here’s the updated, still-concise one-pager that asks for multi-format support while staying minimal and implementation-friendly.

Request: Minimal MAX/AsyncRT C API for Zero-Copy Interop (Multi-Format)

Goal

Let Mojo kernels write into device-local VRAM, and let OpenGL/Vulkan present it without device→host copies, by sharing:

a GPU memory handle (POSIX FD), and
a GPU sync object (timeline semaphore, POSIX FD).

OpenGL/Vulkan already import these. We need export + signal from MAX/AsyncRT.

Proposed C ABI (exact)

Return 0 on success; negative -errno on failure. All fields are required unless marked optional.

// Pixel format (minimal but useful set; extensible)
typedef enum {
  MAX_FMT_R8            = 1,   // 1x8  UNORM
  MAX_FMT_RG8           = 2,   // 2x8  UNORM
  MAX_FMT_RGBA8         = 3,   // 4x8  UNORM (little-endian, RGBA order)
  MAX_FMT_BGRA8         = 4,   // 4x8  UNORM (BGRA order)       // optional
  MAX_FMT_SRGBA8        = 5,   // 4x8  sRGB                      // optional
  MAX_FMT_R16F          = 6,   // 1x16 float
  MAX_FMT_RG16F         = 7,   // 2x16 float
  MAX_FMT_RGBA16F       = 8,   // 4x16 float
  MAX_FMT_R32F          = 9,   // 1x32 float                     // optional
  MAX_FMT_RG32F         = 10,  // 2x32 float                     // optional
  MAX_FMT_RGBA32F       = 11   // 4x32 float                     // optional
} MAX_Format;

// 1) Export device-local buffer as OPAQUE_FD (single-plane, linear)
int AsyncRT_DeviceBuffer_export_fd(
    const void* device_buffer_handle,     // Mojo _DeviceBufferPtr
    int*        out_fd,                   // POSIX fd (dup'ed for caller)
    uint64_t*   out_size_bytes,           // total allocation size
    uint32_t*   out_row_pitch_bytes,      // bytes per row (>= width * bytes_per_pixel)
    uint32_t*   out_width,                // in pixels
    uint32_t*   out_height,               // in pixels
    MAX_Format* out_format                // from enum above
);

// 2) Create timeline semaphore and export as OPAQUE_FD
int AsyncRT_Create_timeline_semaphore_fd(
    const void* device_context_handle,    // Mojo _DeviceContextPtr (same device/stream)
    uint64_t    initial_value,
    int*        out_fd
);

// 3) Signal timeline semaphore FD to 'value' (enqueue on same stream as prior work)
int AsyncRT_Signal_timeline_semaphore_fd(
    int         sem_fd,
    uint64_t    value
);

Semantics (precise, minimal)

Memory:
- Single-plane, linear layout (no tiling/modifiers) for the formats above.
- Little-endian channel packing. Channel order per name (e.g., RGBA8 = R,G,B,A).
- Provide row_pitch_bytes even if tightly packed (often width * bpp), to allow alignment.
- Allocation must be exportable and compatible with queue-family EXTERNAL ownership transfers.
Semaphore: timeline preferred; Mojo will signal value N after writing frame N.
Ordering: Signal_timeline_semaphore_fd() must be after the kernel (same stream).
FD lifetime: Returned fds are new (dup’ed). Importers typically consume/dup; caller may close() its copy after successful import.
Extensibility: New formats can be added to MAX_Format without breaking ABI.

Optional dma-buf variant (if you prefer dma-buf)

If memory export is via dma-buf, expose this alternate (or additional) function. It carries explicit layout metadata; fourcc/modifier are required.

// Single-plane dma-buf export (multi-plane/YUV out of scope for v1)
int AsyncRT_DeviceBuffer_export_dmabuf(
    const void* device_buffer_handle,
    int*        out_dmabuf_fd,
    uint64_t*   out_size_bytes,
    uint32_t*   out_stride_bytes,     // row pitch
    uint32_t*   out_width,
    uint32_t*   out_height,
    uint32_t*   out_fourcc,           // e.g., DRM_FORMAT_RGBA8888 / BGRA8888
    uint64_t*   out_modifier          // 0 = linear preferred for v1
);

GL path: EGL_EXT_image_dma_buf_import → glEGLImageTargetTexture2DOES.
VK path: VK_EXT_external_memory_dma_buf where supported; otherwise use OPAQUE_FD path.

Consumer expectations (we already have these)

OpenGL: map MAX_Format → internal formats (GL_R8, GL_RG8, GL_RGBA8, GL_RGBA16F, GL_RGBA32F, …). For R*/RG*, we’ll set texture swizzles to expand to RGBA if needed.
Vulkan: map to VkFormat (e.g., VK_FORMAT_R8_UNORM, VK_FORMAT_R8G8_UNORM, VK_FORMAT_R8G8B8A8_UNORM, VK_FORMAT_R16G16B16A16_SFLOAT, VK_FORMAT_R32G32B32A32_SFLOAT, …). We copy buffer → image and draw.

Mojo stdlib shim (we’ll PR)

We’ll add gpu/host/interop_fd.mojo that forwards to:

AsyncRT_DeviceBuffer_export_fd(...)
AsyncRT_Create_timeline_semaphore_fd(...)
AsyncRT_Signal_timeline_semaphore_fd(...)

so app code stays pure-Mojo and won’t change when these land.

Minimal acceptance test

Export RGBA8 64×64 → GL/VK import, present a solid color.
Repeat with R8 and RGBA16F; verify correct sampling (GL swizzle for R8).
Animate 10k frames; no CPU copies; no FD leaks.

Answer to “Do we need more than RGBA8 right now?”

No to ship a POC; Yes to future-proof. The enum above gives a practical superset (R8/RG8/RGBA8 + 16F/32F variants + optional BGRA8/sRGB) without dragging in multi-plane YUV.

Topic		Replies	Views
Mojo manual gpu basics exercise does not compile GPU Programming 25_3	7	142	April 2, 2025
Async Streaming from Device to Host Mojo discussion	0	44	July 2, 2025
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1196	April 9, 2025
Examples of programming GPU functions using the Mojo MAX Driver API MAX discussion , gpu , 25_1	5	419	April 26, 2025
GPU Float64 memset support Mojo discussion , 25_1	1	78	August 16, 2025