Content:
Source Code:
Please leave any questions, feedback, or suggestions for the GPU programming manual in this thread.
Content:
Source Code:
Please leave any questions, feedback, or suggestions for the GPU programming manual in this thread.
Hey @jack.clayton , awesome content! Very easy to run with the vscode extension
I got an error on the first cell, the import is wrong, “gpu.host” instead of “gpu”:
/tmp/mdlab/main.mojo:1:29: error: package 'gpu' does not contain 'DeviceContext'
from gpu import thread_idx, DeviceContext
Thanks for that, fix will go out in the next nightly. Working on getting the code blocks tested in CI
Some other feedback:
grid_dim
and block_dim
having the same number of elements in the tuple, and always use tuples. Any shortening is syntax sugar and actually makes it harder to understand. Sugar can be introduced later on once the concept has been grasped by the user. For example:ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=2)
can be written as ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=(2, 1))
which is much easier to understand.Overall great work! I was not familliar with gpu programming and reading/running the tutorial was very smooth!
Awesome great feedback thanks for all that @gabrieldemarmiesse, I’ll fix these up
I had some time to do the exercises, some small feedback:
warp.sum()
we get constraint failed: unhandled shuffle type
. Since the exercise encourages to try multiple methods of reduction, trying the warp seems natural.barrier()
in the example code. Maybe we could write somewhere why it’s needed? Is the variable value
undefined unless we call barrier()
? For the exercise, I tried without a barrier() and it worked..reduce_add()
? That’s not very clear. I remember the community meeting with Daniel Lemire, when I asked if gpus had simd instruction and I understood that no, gpus don’t have simd instructions. So it’s quite confusing. Could this be clarified?Hi all! Thanks for the tutorial! I started on it (GPU basics | Modular), but in the first example I encountered a compile error: ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
# => compile error on [printing_kernel]: expected a type, not a value
I use Mojo v25.1.1, could that explain the problem? Do I need to update to nightly?
I believe you may need the nightly for these, Jack made some enhancements to the function compilation / enqueueing interfaces that I think these rely on. Those enhancements are only in the nightlies for the last ~week or so.
Thanks, it works with mojo 25.2.0.dev2025031605 (900b34e5)
Yeah good point will change that
For the warp you’re right, it doesn’t need a barrier there
Thanks very much for pointing this out you’re right, the generated PTX is doing individual scalar load and adds (e.g. 4 float 32 reduce sum):
ld.global.f32 |%f1, [%rd2+8];
ld.global.f32 |%f2, [%rd2];
ld.global.f32 |%f3, [%rd2+12];
ld.global.f32 |%f4, [%rd2+4];
add.rn.f32 |%f5, %f4, %f3;
add.rn.f32 |%f6, %f2, %f1;
add.rn.f32 |%f7, %f6, %f5;
st.global.f32 |[%rd2], %f7;
Will reword this section and point out that typical SIMD does not exist on GPU and you need to use SIMT with warps instead. I don’t think that transforming SIMD.reduce_add etc to warp SIMT instructions is viable because that would require launching extra threads.
All of your feedback has been very valuable cheers, I’ll get all the fixes in today.
Thanks IvoB
I’ll add a part about updating to latest Mojo cheers
I’ve made changes for all suggestions that will go out will next release, just a few things where I diverged from your suggestions a bit:
in
as input
so they’re now host_in_buffer
, device_in_buffer
etc.grid_dim
and block_dim
with a scalar is common, so I added a tip box explaining that you can use a scalar or tuple and what they map toThanks again for all the great suggestions, everything else will be fixed when the next nightly goes out.
Sounds good, thanks for the fixes!
@jack.clayton I think I caught an error in the current tutorial:
var shared = stack_allocation[
threads * sizeof[dtype](),
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
is allocating too much memory, isn’t it? Maybe
var shared = stack_allocation[
threads,
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
is better?
From the docs stack_allocation | Modular the first parameter should be the number of elements, not the number of bytes.
Actually it seems that SIMD is possible in gpus. Maybe the gpu you were using didn’t support this? Or maybe it’s a compiler issue?
Thanks asked about this, there are some SIMD instructions available like SIMD load, but they have alignment requirements. e.g. this:
fn foo(data: UnsafePointer[Scalar[DType.float32]]):
var x = data.load[
width=4, alignment = alignof[SIMD[DType.float32, 4]]()
]()
Generates PTX:
ld.global.v4.f32 {%f1, %f2, %f3, %f4}, [%rd2];
Will reword again thanks, and fix that stack_allocation