GPU Programming Manual

Content:

Source Code:

Please leave any questions, feedback, or suggestions for the GPU programming manual in this thread.

4 Likes

Hey @jack.clayton , awesome content! Very easy to run with the vscode extension :slight_smile:

I got an error on the first cell, the import is wrong, “gpu.host” instead of “gpu”:

/tmp/mdlab/main.mojo:1:29: error: package 'gpu' does not contain 'DeviceContext'
from gpu import thread_idx, DeviceContext

Thanks for that, fix will go out in the next nightly. Working on getting the code blocks tested in CI

Some other feedback:

  • We can try to have more meaningful variable names, especially regarding aggressive abbreviation.
    I had to read a few time to understand that “els” = “elements” “in” = “input”, “dev” = “device”, “out” = “output”. Also because “in_dev” does not mean “input_device”, it actually means “input_buffer_on_device”.
  • The tutorial has “dtype” and “type” but I don’t think we need both, it adds unnecessary complexity.
  • We can try to stick with grid_dim and block_dim having the same number of elements in the tuple, and always use tuples. Any shortening is syntax sugar and actually makes it harder to understand. Sugar can be introduced later on once the concept has been grasped by the user. For example:
    ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=2) can be written as ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=(2, 1)) which is much easier to understand.
  • The markdown file can be split into two files. This is because when working on a separate mojo file and copy-pasting, it can get really long, and we don’t necessarily re-use functions or variables. It’s also harder to run/read a specific section or the tutorial I’m interested in.
  • For thread-level simd, 4 seems like a magic number, and it’s not explained how to find it, or even how to query the gpu to know what is the simd size available. A small explanation would be welcomed in this section.

Overall great work! I was not familliar with gpu programming and reading/running the tutorial was very smooth!

2 Likes

Awesome great feedback thanks for all that @gabrieldemarmiesse, I’ll fix these up

I had some time to do the exercises, some small feedback:

  • Int64 is not the best data type to use for the exercise, notably because it doesn’t work if using warp.sum() we get constraint failed: unhandled shuffle type. Since the exercise encourages to try multiple methods of reduction, trying the warp seems natural.
  • Talking about the warp, it’s not clear why it’s needed to call barrier() in the example code. Maybe we could write somewhere why it’s needed? Is the variable value undefined unless we call barrier()? For the exercise, I tried without a barrier() and it worked.
  • I asked chatgpt and it doesn’t understand well the thread level SIMD. Notably it seems that gpus have no SIMD registers. Is chatgpt wrong? Or is it using a warp behind the scenes for the reduction when calling .reduce_add()? That’s not very clear. I remember the community meeting with Daniel Lemire, when I asked if gpus had simd instruction and I understood that no, gpus don’t have simd instructions. So it’s quite confusing. Could this be clarified?
1 Like

Hi all! Thanks for the tutorial! I started on it (GPU basics | Modular), but in the first example I encountered a compile error: ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
# => compile error on [printing_kernel]: expected a type, not a value
I use Mojo v25.1.1, could that explain the problem? Do I need to update to nightly?

I believe you may need the nightly for these, Jack made some enhancements to the function compilation / enqueueing interfaces that I think these rely on. Those enhancements are only in the nightlies for the last ~week or so.

Thanks, it works with mojo 25.2.0.dev2025031605 (900b34e5)

  1. Yeah good point will change that

  2. For the warp you’re right, it doesn’t need a barrier there

  3. Thanks very much for pointing this out you’re right, the generated PTX is doing individual scalar load and adds (e.g. 4 float 32 reduce sum):

ld.global.f32 |%f1, [%rd2+8];
ld.global.f32 |%f2, [%rd2];
ld.global.f32 |%f3, [%rd2+12];
ld.global.f32 |%f4, [%rd2+4];
add.rn.f32 |%f5, %f4, %f3;
add.rn.f32 |%f6, %f2, %f1;
add.rn.f32 |%f7, %f6, %f5;
st.global.f32 |[%rd2], %f7;

Will reword this section and point out that typical SIMD does not exist on GPU and you need to use SIMT with warps instead. I don’t think that transforming SIMD.reduce_add etc to warp SIMT instructions is viable because that would require launching extra threads.

All of your feedback has been very valuable cheers, I’ll get all the fixes in today.

Thanks IvoB I’ll add a part about updating to latest Mojo cheers

I’ve made changes for all suggestions that will go out will next release, just a few things where I diverged from your suggestions a bit:

  • Have made all the variables and aliases more descriptive, but kept in as input so they’re now host_in_buffer, device_in_buffer etc.
  • grid_dim and block_dim with a scalar is common, so I added a tip box explaining that you can use a scalar or tuple and what they map to
  • This document is used for the online docs and so can’t be split up into two files, plus a lot of the code blocks do reuse a lot of things from earlier cells like the imports and setting up buffers, which I’d have to repeat if there were multiple markdown files.

Thanks again for all the great suggestions, everything else will be fixed when the next nightly goes out.

Sounds good, thanks for the fixes!

@jack.clayton I think I caught an error in the current tutorial:

    var shared = stack_allocation[
        threads * sizeof[dtype](),
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()

is allocating too much memory, isn’t it? Maybe

    var shared = stack_allocation[
        threads,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()

is better?

From the docs stack_allocation | Modular the first parameter should be the number of elements, not the number of bytes.

Actually it seems that SIMD is possible in gpus. Maybe the gpu you were using didn’t support this? Or maybe it’s a compiler issue?

Thanks asked about this, there are some SIMD instructions available like SIMD load, but they have alignment requirements. e.g. this:

fn foo(data: UnsafePointer[Scalar[DType.float32]]):
    var x = data.load[
        width=4, alignment = alignof[SIMD[DType.float32, 4]]()
    ]()

Generates PTX:

ld.global.v4.f32 {%f1, %f2, %f3, %f4}, [%rd2];

Will reword again thanks, and fix that stack_allocation