Hello,
I would like to build a real time AI engine on mojo. To do this I will need to be able to zero copy from the mojo allocated device buffer into GL/Vulkan. To the best of my knowledge I could not figure out a way to achieve this, which makes the prospect of a real time (visual) engine a non-starter.
A possible approach (and i have no idea how difficult this would be on the runtime side) would be to to expose FD-based interop—exporting GPU memory as a POSIX fd and signaling a timeline semaphore fd—so GL/VK can import and present the same buffer in zero-copy.
Here is a proposed one pager (I relied heavily on different AI agents to get to this point, so even a sanity check from someone down in the weeds would be helpful at this point. I’ve done my best to verify these findings with the available source/docs/discord knowledge bases as well.)
Absolutely—here’s the updated, still-concise one-pager that asks for multi-format support while staying minimal and implementation-friendly.
Request: Minimal MAX/AsyncRT C API for Zero-Copy Interop (Multi-Format)
Goal
Let Mojo kernels write into device-local VRAM, and let OpenGL/Vulkan present it without device→host copies, by sharing:
-
a GPU memory handle (POSIX FD), and
-
a GPU sync object (timeline semaphore, POSIX FD).
OpenGL/Vulkan already import these. We need export + signal from MAX/AsyncRT.
Proposed C ABI (exact)
Return 0
on success; negative -errno
on failure. All fields are required unless marked optional.
// Pixel format (minimal but useful set; extensible)
typedef enum {
MAX_FMT_R8 = 1, // 1x8 UNORM
MAX_FMT_RG8 = 2, // 2x8 UNORM
MAX_FMT_RGBA8 = 3, // 4x8 UNORM (little-endian, RGBA order)
MAX_FMT_BGRA8 = 4, // 4x8 UNORM (BGRA order) // optional
MAX_FMT_SRGBA8 = 5, // 4x8 sRGB // optional
MAX_FMT_R16F = 6, // 1x16 float
MAX_FMT_RG16F = 7, // 2x16 float
MAX_FMT_RGBA16F = 8, // 4x16 float
MAX_FMT_R32F = 9, // 1x32 float // optional
MAX_FMT_RG32F = 10, // 2x32 float // optional
MAX_FMT_RGBA32F = 11 // 4x32 float // optional
} MAX_Format;
// 1) Export device-local buffer as OPAQUE_FD (single-plane, linear)
int AsyncRT_DeviceBuffer_export_fd(
const void* device_buffer_handle, // Mojo _DeviceBufferPtr
int* out_fd, // POSIX fd (dup'ed for caller)
uint64_t* out_size_bytes, // total allocation size
uint32_t* out_row_pitch_bytes, // bytes per row (>= width * bytes_per_pixel)
uint32_t* out_width, // in pixels
uint32_t* out_height, // in pixels
MAX_Format* out_format // from enum above
);
// 2) Create timeline semaphore and export as OPAQUE_FD
int AsyncRT_Create_timeline_semaphore_fd(
const void* device_context_handle, // Mojo _DeviceContextPtr (same device/stream)
uint64_t initial_value,
int* out_fd
);
// 3) Signal timeline semaphore FD to 'value' (enqueue on same stream as prior work)
int AsyncRT_Signal_timeline_semaphore_fd(
int sem_fd,
uint64_t value
);
Semantics (precise, minimal)
-
Memory:
-
Single-plane, linear layout (no tiling/modifiers) for the formats above.
-
Little-endian channel packing. Channel order per name (e.g.,
RGBA8
= R,G,B,A). -
Provide
row_pitch_bytes
even if tightly packed (oftenwidth * bpp
), to allow alignment. -
Allocation must be exportable and compatible with queue-family EXTERNAL ownership transfers.
-
-
Semaphore: timeline preferred; Mojo will signal value N after writing frame N.
-
Ordering:
Signal_timeline_semaphore_fd()
must be after the kernel (same stream). -
FD lifetime: Returned fds are new (dup’ed). Importers typically consume/dup; caller may
close()
its copy after successful import. -
Extensibility: New formats can be added to
MAX_Format
without breaking ABI.
Optional dma-buf variant (if you prefer dma-buf)
If memory export is via dma-buf, expose this alternate (or additional) function. It carries explicit layout metadata; fourcc/modifier are required.
// Single-plane dma-buf export (multi-plane/YUV out of scope for v1)
int AsyncRT_DeviceBuffer_export_dmabuf(
const void* device_buffer_handle,
int* out_dmabuf_fd,
uint64_t* out_size_bytes,
uint32_t* out_stride_bytes, // row pitch
uint32_t* out_width,
uint32_t* out_height,
uint32_t* out_fourcc, // e.g., DRM_FORMAT_RGBA8888 / BGRA8888
uint64_t* out_modifier // 0 = linear preferred for v1
);
-
GL path:
EGL_EXT_image_dma_buf_import
→glEGLImageTargetTexture2DOES
. -
VK path:
VK_EXT_external_memory_dma_buf
where supported; otherwise use OPAQUE_FD path.
Consumer expectations (we already have these)
-
OpenGL: map
MAX_Format
→ internal formats (GL_R8
,GL_RG8
,GL_RGBA8
,GL_RGBA16F
,GL_RGBA32F
, …). ForR*
/RG*
, we’ll set texture swizzles to expand to RGBA if needed. -
Vulkan: map to
VkFormat
(e.g.,VK_FORMAT_R8_UNORM
,VK_FORMAT_R8G8_UNORM
,VK_FORMAT_R8G8B8A8_UNORM
,VK_FORMAT_R16G16B16A16_SFLOAT
,VK_FORMAT_R32G32B32A32_SFLOAT
, …). We copybuffer → image
and draw.
Mojo stdlib shim (we’ll PR)
We’ll add gpu/host/interop_fd.mojo
that forwards to:
-
AsyncRT_DeviceBuffer_export_fd(...)
-
AsyncRT_Create_timeline_semaphore_fd(...)
-
AsyncRT_Signal_timeline_semaphore_fd(...)
so app code stays pure-Mojo and won’t change when these land.
Minimal acceptance test
-
Export RGBA8 64×64 → GL/VK import, present a solid color.
-
Repeat with R8 and RGBA16F; verify correct sampling (GL swizzle for R8).
-
Animate 10k frames; no CPU copies; no FD leaks.
Answer to “Do we need more than RGBA8 right now?”
No to ship a POC; Yes to future-proof. The enum above gives a practical superset (R8/RG8/RGBA8 + 16F/32F variants + optional BGRA8/sRGB) without dragging in multi-plane YUV.