GPU tensor creation?

alpha21164enthusiast · May 10, 2025, 7:45pm

Hi!

I’m super new to mojo and I’ve been looking through some of the examples, but I can’t quite figure out how to create new tensors on the GPU. Basically, I’m looking for the equivalent of:

torch.zeros(shape, device=“cuda”) or torch.randn(shape, device=“cuda”)

in mojo.

I can see that in examples such as naive_matrix_multiplication, we can allocate buffers on the GPU and then overlay them with LayoutTensors, but I’m not sure if this is the intended way to create tensors on the GPU. This actually begs a further question: how can I create LayoutTensors with dynamic shapes? I see that RuntimeLayout is a thing, but there don’t seem to be many examples that show me how to use it.

Sorry for the kind of vague questions, I’m clearly very confused

Thanks!

arthur · May 10, 2025, 11:27pm

There are a couple of ways to create tensors on the GPU. Yes, you can create a device buffers (which lives in the GPU’s global memory) and create a LayoutTensor which is a view of that memory. (You can then use various methods like LayoutTensor.tile() to get different views of that memory, subsets, etc.)

If you want to copy data to shared memory or local (register) memory, you can use the stack_allocation() static method to create a statically-sized tensor in the appropriate memory space. You can also do this with the LayoutTensorBuild struct, which provides a means for defining and allocating tensors in different memory spaces.

You can see an example of this in Kernel 3 of the matrix multiplication walkthrough: Custom Operations: Optimizing Matrix Multiplication Recipe | MAX Builds

(Or if you want to look at the full source of those kernels, here: max-recipes/custom-ops-matrix-multiplication/operations/matrix_multiplication.mojo at main · modular/max-recipes · GitHub)

Here’s an example of using LayoutTensor.stack_allocation() to create a tensor tile in local (register) memory:

github.com/modular/modular

max/kernels/src/linalg/matmul_sm90.mojo

0bfe79b5b


      
          
          @parameter
          if num_consumer == 1 or num_consumer == 2:
              alias num_regs = 256 if num_consumer == 1 else 240
              warpgroup_reg_alloc[num_regs]()
          else:
              warpgroup_reg_alloc[160]()
          
          var local_warp_group_idx = warp_group_idx - 1
          
          var c_reg_tile = LayoutTensor[
              accum_type,
              Layout.row_major(num_m_mmas * num_n_mmas, c_frag_size),
              MutableAnyOrigin,
              address_space = AddressSpace.LOCAL,
          ].stack_allocation()
          
          var final_c_reg_tile = LayoutTensor[
              accum_type,
              Layout.row_major(num_m_mmas * num_n_mmas, c_frag_size),
              MutableAnyOrigin,

You can create dynamic layouts with LayoutTensorBuild or directly with LayoutTensor. (I just posted a simple example of creating a dynamic layout tensor in another thread on this forum.)

If you look through the MAX kernel library, you’ll see a lot of code using either the standalone stack_allocation() function or using NDBuffer.stack_allocation(). These are all basically doing the same thing. (We’re trying to phase out NDBuffer in favor of LayoutTensor, but right now there’s still a lot of code using NDBuffer.)

Hope that helps!

Topic		Replies	Views
Model/TensorMap to dynamically handle MANY DriverTensors as inputs? GPU Programming discussion , 25_3	8	113	April 18, 2025
How to package/interface with a GPU kernel with dynamic sized tensors (dynamic LayoutTensor) GPU Programming	14	223	May 10, 2025
Doubt related to Mojo and direct GPU memory access GPU Programming	4	135	April 17, 2025
Looking for examples of mulit-gpu usage with Mojo GPU Programming gpu	3	203	April 4, 2025
Examples of custom CPU / GPU operations in Mojo MAX discussion , 24_6	28	1064	April 9, 2025

GPU tensor creation?

Related topics