GPU tensor creation?

Hi!

I’m super new to mojo and I’ve been looking through some of the examples, but I can’t quite figure out how to create new tensors on the GPU. Basically, I’m looking for the equivalent of:

torch.zeros(shape, device=“cuda”) or torch.randn(shape, device=“cuda”)

in mojo.

I can see that in examples such as naive_matrix_multiplication, we can allocate buffers on the GPU and then overlay them with LayoutTensors, but I’m not sure if this is the intended way to create tensors on the GPU. This actually begs a further question: how can I create LayoutTensors with dynamic shapes? I see that RuntimeLayout is a thing, but there don’t seem to be many examples that show me how to use it.

Sorry for the kind of vague questions, I’m clearly very confused :slight_smile:

Thanks!

1 Like

There are a couple of ways to create tensors on the GPU. Yes, you can create a device buffers (which lives in the GPU’s global memory) and create a LayoutTensor which is a view of that memory. (You can then use various methods like LayoutTensor.tile() to get different views of that memory, subsets, etc.)

If you want to copy data to shared memory or local (register) memory, you can use the stack_allocation() static method to create a statically-sized tensor in the appropriate memory space. You can also do this with the LayoutTensorBuild struct, which provides a means for defining and allocating tensors in different memory spaces.

You can see an example of this in Kernel 3 of the matrix multiplication walkthrough: Custom Operations: Optimizing Matrix Multiplication Recipe | MAX Builds

(Or if you want to look at the full source of those kernels, here: max-recipes/custom-ops-matrix-multiplication/operations/matrix_multiplication.mojo at main · modular/max-recipes · GitHub)

Here’s an example of using LayoutTensor.stack_allocation() to create a tensor tile in local (register) memory:

You can create dynamic layouts with LayoutTensorBuild or directly with LayoutTensor. (I just posted a simple example of creating a dynamic layout tensor in another thread on this forum.)

If you look through the MAX kernel library, you’ll see a lot of code using either the standalone stack_allocation() function or using NDBuffer.stack_allocation(). These are all basically doing the same thing. (We’re trying to phase out NDBuffer in favor of LayoutTensor, but right now there’s still a lot of code using NDBuffer.)

Hope that helps!

2 Likes