I was reading the Mojo documentation and I didn’t find anything related to GPU pointers to GPU memory allocations which I think could be useful for bona fide low-level operations. In particular, I was trying to implement a GPU array of pointers to tensor objects accessed through a specific kernel. Is there some mechanism to supply this apparent lack of low-level control?
Thanks in advance.
@BradLarson might be able to give an answer on either how to do this or if/when it’s planned to be possible.
Please correct me if you’re asking for something else, but if you’re looking to create and get a pointer to a GPU global memory buffer, DeviceContext’s enqueue_create_buffer
is probably the most direct route.
An example of this can be found in the “Enqueue scheduling” section in the GPU basics guide, along with other ways to interact with DeviceContext:
alias size = 4
alias dtype_u8 = DType.uint8
fn dummy_kernel(buffer: UnsafePointer[Scalar[dtype_u8]]):
buffer[thread_idx.x] = thread_idx.x
# All of these method calls run in the order that they were enqueued
var host_buffer = ctx.enqueue_create_host_buffer[dtype_u8](size)
var dev_buffer = ctx.enqueue_create_buffer[dtype_u8](size)
ctx.enqueue_function[dummy_kernel](dev_buffer, grid_dim=1, block_dim=size)
dev_buffer.enqueue_copy_to(host_buffer)
# Have to synchronize here before printing on CPU, or else the kernel may
# not have finished executing.
ctx.synchronize()
print(host_buffer)
In many of our current Mojo GPU programming examples, we’ve also used the Mojo Driver API to do host or device buffer allocation, but that effectively layers on top of the DeviceContext. We’re working to harmonize these interfaces.
There are different ways to deal with shared memory inside a GPU function, and if I’ve misinterpreted your question please let me know.
4 Likes
I think @ipoppopretu is asking for how to create a list of tensors to that can be passed to a kernel. If the tensors are all the same shape, then that makes it easy, but if they aren’t then I’m not sure of a way to pass what is effectively a List[LayoutTensor[...]]
with different layouts, even if it requires runtime layouts, into MAX. This problem generalizes to “how do I set up an arbitrary data structure and pass it over to the GPU”.
I’ve spent some time thinking about this problem in the context of custom allocators, since I think in some cases it might make sense to make use of unified memory to set up a data structure such as a tree using the CPU, then run the parallel component of a computation as a GPU kernel. Custom allocators would help since you could create a wrapper for cudaMallocManaged
or hipMallocManaged
as one of the stdlib allocators. Yes, the usual performance hazards with doing cross-device memory accesses apply, but the best alternative I see is setting it all up in an arena allocator over DMA-safe memory, then doing a DMA to the GPU followed by pointer fixup. Right now, as far as I am aware, Mojo doesn’t really have a way to allocate “heap memory” on the GPU, just buffers. For anything that isn’t doing pure linear algebra and can’t easily decompose the inputs into a set of SIMD/String parameters/arguments on an operation.
2 Likes