Zero-copy GPU interop: export/import FD API for Mojo DeviceBuffer + DeviceContext

Hello,

I would like to build a real time AI engine on mojo. To do this I will need to be able to zero copy from the mojo allocated device buffer into GL/Vulkan. To the best of my knowledge I could not figure out a way to achieve this, which makes the prospect of a real time (visual) engine a non-starter.

A possible approach (and i have no idea how difficult this would be on the runtime side) would be to to expose FD-based interop—exporting GPU memory as a POSIX fd and signaling a timeline semaphore fd—so GL/VK can import and present the same buffer in zero-copy.

Here is a proposed one pager (I relied heavily on different AI agents to get to this point, so even a sanity check from someone down in the weeds would be helpful at this point. I’ve done my best to verify these findings with the available source/docs/discord knowledge bases as well.)

Absolutely—here’s the updated, still-concise one-pager that asks for multi-format support while staying minimal and implementation-friendly.


Request: Minimal MAX/AsyncRT C API for Zero-Copy Interop (Multi-Format)

Goal

Let Mojo kernels write into device-local VRAM, and let OpenGL/Vulkan present it without device→host copies, by sharing:

  1. a GPU memory handle (POSIX FD), and

  2. a GPU sync object (timeline semaphore, POSIX FD).

OpenGL/Vulkan already import these. We need export + signal from MAX/AsyncRT.


Proposed C ABI (exact)

Return 0 on success; negative -errno on failure. All fields are required unless marked optional.

// Pixel format (minimal but useful set; extensible)
typedef enum {
  MAX_FMT_R8            = 1,   // 1x8  UNORM
  MAX_FMT_RG8           = 2,   // 2x8  UNORM
  MAX_FMT_RGBA8         = 3,   // 4x8  UNORM (little-endian, RGBA order)
  MAX_FMT_BGRA8         = 4,   // 4x8  UNORM (BGRA order)       // optional
  MAX_FMT_SRGBA8        = 5,   // 4x8  sRGB                      // optional
  MAX_FMT_R16F          = 6,   // 1x16 float
  MAX_FMT_RG16F         = 7,   // 2x16 float
  MAX_FMT_RGBA16F       = 8,   // 4x16 float
  MAX_FMT_R32F          = 9,   // 1x32 float                     // optional
  MAX_FMT_RG32F         = 10,  // 2x32 float                     // optional
  MAX_FMT_RGBA32F       = 11   // 4x32 float                     // optional
} MAX_Format;

// 1) Export device-local buffer as OPAQUE_FD (single-plane, linear)
int AsyncRT_DeviceBuffer_export_fd(
    const void* device_buffer_handle,     // Mojo _DeviceBufferPtr
    int*        out_fd,                   // POSIX fd (dup'ed for caller)
    uint64_t*   out_size_bytes,           // total allocation size
    uint32_t*   out_row_pitch_bytes,      // bytes per row (>= width * bytes_per_pixel)
    uint32_t*   out_width,                // in pixels
    uint32_t*   out_height,               // in pixels
    MAX_Format* out_format                // from enum above
);

// 2) Create timeline semaphore and export as OPAQUE_FD
int AsyncRT_Create_timeline_semaphore_fd(
    const void* device_context_handle,    // Mojo _DeviceContextPtr (same device/stream)
    uint64_t    initial_value,
    int*        out_fd
);

// 3) Signal timeline semaphore FD to 'value' (enqueue on same stream as prior work)
int AsyncRT_Signal_timeline_semaphore_fd(
    int         sem_fd,
    uint64_t    value
);

Semantics (precise, minimal)

  • Memory:

    • Single-plane, linear layout (no tiling/modifiers) for the formats above.

    • Little-endian channel packing. Channel order per name (e.g., RGBA8 = R,G,B,A).

    • Provide row_pitch_bytes even if tightly packed (often width * bpp), to allow alignment.

    • Allocation must be exportable and compatible with queue-family EXTERNAL ownership transfers.

  • Semaphore: timeline preferred; Mojo will signal value N after writing frame N.

  • Ordering: Signal_timeline_semaphore_fd() must be after the kernel (same stream).

  • FD lifetime: Returned fds are new (dup’ed). Importers typically consume/dup; caller may close() its copy after successful import.

  • Extensibility: New formats can be added to MAX_Format without breaking ABI.


Optional dma-buf variant (if you prefer dma-buf)

If memory export is via dma-buf, expose this alternate (or additional) function. It carries explicit layout metadata; fourcc/modifier are required.

// Single-plane dma-buf export (multi-plane/YUV out of scope for v1)
int AsyncRT_DeviceBuffer_export_dmabuf(
    const void* device_buffer_handle,
    int*        out_dmabuf_fd,
    uint64_t*   out_size_bytes,
    uint32_t*   out_stride_bytes,     // row pitch
    uint32_t*   out_width,
    uint32_t*   out_height,
    uint32_t*   out_fourcc,           // e.g., DRM_FORMAT_RGBA8888 / BGRA8888
    uint64_t*   out_modifier          // 0 = linear preferred for v1
);

  • GL path: EGL_EXT_image_dma_buf_importglEGLImageTargetTexture2DOES.

  • VK path: VK_EXT_external_memory_dma_buf where supported; otherwise use OPAQUE_FD path.


Consumer expectations (we already have these)

  • OpenGL: map MAX_Format → internal formats (GL_R8, GL_RG8, GL_RGBA8, GL_RGBA16F, GL_RGBA32F, …). For R*/RG*, we’ll set texture swizzles to expand to RGBA if needed.

  • Vulkan: map to VkFormat (e.g., VK_FORMAT_R8_UNORM, VK_FORMAT_R8G8_UNORM, VK_FORMAT_R8G8B8A8_UNORM, VK_FORMAT_R16G16B16A16_SFLOAT, VK_FORMAT_R32G32B32A32_SFLOAT, …). We copy buffer → image and draw.


Mojo stdlib shim (we’ll PR)

We’ll add gpu/host/interop_fd.mojo that forwards to:

  • AsyncRT_DeviceBuffer_export_fd(...)

  • AsyncRT_Create_timeline_semaphore_fd(...)

  • AsyncRT_Signal_timeline_semaphore_fd(...)

so app code stays pure-Mojo and won’t change when these land.


Minimal acceptance test

  1. Export RGBA8 64×64 → GL/VK import, present a solid color.

  2. Repeat with R8 and RGBA16F; verify correct sampling (GL swizzle for R8).

  3. Animate 10k frames; no CPU copies; no FD leaks.


Answer to “Do we need more than RGBA8 right now?”

No to ship a POC; Yes to future-proof. The enum above gives a practical superset (R8/RG8/RGBA8 + 16F/32F variants + optional BGRA8/sRGB) without dragging in multi-plane YUV.