Doubt related to Mojo and direct GPU memory access

ipoppopretu · April 16, 2025, 7:09pm

I was reading the Mojo documentation and I didn’t find anything related to GPU pointers to GPU memory allocations which I think could be useful for bona fide low-level operations. In particular, I was trying to implement a GPU array of pointers to tensor objects accessed through a specific kernel. Is there some mechanism to supply this apparent lack of low-level control?

Thanks in advance.

melodyogonna · April 17, 2025, 11:28am

Did you check this: memory | Modular and this: intrinsics | Modular?

owenhilyard · April 17, 2025, 12:56pm

@BradLarson might be able to give an answer on either how to do this or if/when it’s planned to be possible.

BradLarson · April 17, 2025, 2:08pm

Please correct me if you’re asking for something else, but if you’re looking to create and get a pointer to a GPU global memory buffer, DeviceContext’s enqueue_create_buffer is probably the most direct route.

An example of this can be found in the “Enqueue scheduling” section in the GPU basics guide, along with other ways to interact with DeviceContext:

alias size = 4
alias dtype_u8 = DType.uint8

fn dummy_kernel(buffer: UnsafePointer[Scalar[dtype_u8]]):
    buffer[thread_idx.x] = thread_idx.x

# All of these method calls run in the order that they were enqueued
var host_buffer = ctx.enqueue_create_host_buffer[dtype_u8](size)
var dev_buffer = ctx.enqueue_create_buffer[dtype_u8](size)
ctx.enqueue_function[dummy_kernel](dev_buffer, grid_dim=1, block_dim=size)
dev_buffer.enqueue_copy_to(host_buffer)

# Have to synchronize here before printing on CPU, or else the kernel may
# not have finished executing.
ctx.synchronize()
print(host_buffer)

In many of our current Mojo GPU programming examples, we’ve also used the Mojo Driver API to do host or device buffer allocation, but that effectively layers on top of the DeviceContext. We’re working to harmonize these interfaces.

There are different ways to deal with shared memory inside a GPU function, and if I’ve misinterpreted your question please let me know.

owenhilyard · April 17, 2025, 2:31pm

I think @ipoppopretu is asking for how to create a list of tensors to that can be passed to a kernel. If the tensors are all the same shape, then that makes it easy, but if they aren’t then I’m not sure of a way to pass what is effectively a List[LayoutTensor[...]] with different layouts, even if it requires runtime layouts, into MAX. This problem generalizes to “how do I set up an arbitrary data structure and pass it over to the GPU”.

I’ve spent some time thinking about this problem in the context of custom allocators, since I think in some cases it might make sense to make use of unified memory to set up a data structure such as a tree using the CPU, then run the parallel component of a computation as a GPU kernel. Custom allocators would help since you could create a wrapper for cudaMallocManaged or hipMallocManaged as one of the stdlib allocators. Yes, the usual performance hazards with doing cross-device memory accesses apply, but the best alternative I see is setting it all up in an arena allocator over DMA-safe memory, then doing a DMA to the GPU followed by pointer fixup. Right now, as far as I am aware, Mojo doesn’t really have a way to allocate “heap memory” on the GPU, just buffers. For anything that isn’t doing pure linear algebra and can’t easily decompose the inputs into a set of SIMD/String parameters/arguments on an operation.

Topic		Replies	Views
Mojo manual gpu basics exercise does not compile GPU Programming 25_3	7	111	April 2, 2025
GPU Programming Manual Community Showcase gpu , docs , modular-content	17	398	March 26, 2025
Ask Ahmed anything about GPU programming with Mojo (LLVM Developers' Meeting 2024) Community Showcase modular-content , llvm , ask-me-anything	9	685	December 10, 2024
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	3	186	April 4, 2025
Examples of programming GPU functions using the Mojo MAX Driver API MAX discussion , gpu , 25_1	5	287	April 26, 2025

Doubt related to Mojo and direct GPU memory access

Related topics