GPU Programming Manual

jack.clayton · March 13, 2025, 4:00pm

Content:

Source Code:

mojo/docs/manual/gpu/gpu-basics.md

main

---
title: GPU Basics
sidebar_position: 1
description: The basics of GPU programming with Mojo.
---

This documentation aims to build your GPU programming knowledge from the ground
up, starting with the lowest levels of the stack before progressing to
higher-level functionality. It’s designed for a diverse audience, from
experienced GPU developers to programmers new to GPU coding. Mojo allows you to
program NVIDIA and AMD GPUs, with direct access to low-level GPU primitives,
while sharing types and functions that can also run on CPUs where applicable.
If you're experienced with [NVIDIA Compute Unified Device
Architecture](https://developer.nvidia.com/cuda-toolkit) (CUDA) or [AMD Radeon
Open Compute](https://www.amd.com/en/products/software/rocm.html) (ROCm), what
you'll learn here will enable you to expand your reach to more hardware.

## Introduction to massively parallel programming

We can no longer rely on new generations of CPUs to increase application

This file has been truncated. show original

Please leave any questions, feedback, or suggestions for the GPU programming manual in this thread.

gabrieldemarmiesse · March 13, 2025, 4:55pm

Hey @jack.clayton , awesome content! Very easy to run with the vscode extension

I got an error on the first cell, the import is wrong, “gpu.host” instead of “gpu”:

/tmp/mdlab/main.mojo:1:29: error: package 'gpu' does not contain 'DeviceContext'
from gpu import thread_idx, DeviceContext

jack.clayton · March 13, 2025, 5:49pm

Thanks for that, fix will go out in the next nightly. Working on getting the code blocks tested in CI

gabrieldemarmiesse · March 13, 2025, 5:59pm

Some other feedback:

We can try to have more meaningful variable names, especially regarding aggressive abbreviation.
I had to read a few time to understand that “els” = “elements” “in” = “input”, “dev” = “device”, “out” = “output”. Also because “in_dev” does not mean “input_device”, it actually means “input_buffer_on_device”.
The tutorial has “dtype” and “type” but I don’t think we need both, it adds unnecessary complexity.
We can try to stick with grid_dim and block_dim having the same number of elements in the tuple, and always use tuples. Any shortening is syntax sugar and actually makes it harder to understand. Sugar can be introduced later on once the concept has been grasped by the user. For example:
ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=2) can be written as ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=(2, 1)) which is much easier to understand.
The markdown file can be split into two files. This is because when working on a separate mojo file and copy-pasting, it can get really long, and we don’t necessarily re-use functions or variables. It’s also harder to run/read a specific section or the tutorial I’m interested in.
For thread-level simd, 4 seems like a magic number, and it’s not explained how to find it, or even how to query the gpu to know what is the simd size available. A small explanation would be welcomed in this section.

Overall great work! I was not familliar with gpu programming and reading/running the tutorial was very smooth!

jack.clayton · March 13, 2025, 6:13pm

Awesome great feedback thanks for all that @gabrieldemarmiesse, I’ll fix these up

gabrieldemarmiesse · March 14, 2025, 4:26pm

I had some time to do the exercises, some small feedback:

Int64 is not the best data type to use for the exercise, notably because it doesn’t work if using warp.sum() we get constraint failed: unhandled shuffle type. Since the exercise encourages to try multiple methods of reduction, trying the warp seems natural.
Talking about the warp, it’s not clear why it’s needed to call barrier() in the example code. Maybe we could write somewhere why it’s needed? Is the variable value undefined unless we call barrier()? For the exercise, I tried without a barrier() and it worked.
I asked chatgpt and it doesn’t understand well the thread level SIMD. Notably it seems that gpus have no SIMD registers. Is chatgpt wrong? Or is it using a warp behind the scenes for the reduction when calling .reduce_add()? That’s not very clear. I remember the community meeting with Daniel Lemire, when I asked if gpus had simd instruction and I understood that no, gpus don’t have simd instructions. So it’s quite confusing. Could this be clarified?

IvoB · March 16, 2025, 12:24pm

Hi all! Thanks for the tutorial! I started on it (GPU basics | Modular), but in the first example I encountered a compile error: ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
# => compile error on [printing_kernel]: expected a type, not a value
I use Mojo v25.1.1, could that explain the problem? Do I need to update to nightly?

BradLarson · March 16, 2025, 3:15pm

I believe you may need the nightly for these, Jack made some enhancements to the function compilation / enqueueing interfaces that I think these rely on. Those enhancements are only in the nightlies for the last ~week or so.

IvoB · March 16, 2025, 4:52pm

Thanks, it works with mojo 25.2.0.dev2025031605 (900b34e5)

jack.clayton · March 17, 2025, 2:25pm

Yeah good point will change that
For the warp you’re right, it doesn’t need a barrier there
Thanks very much for pointing this out you’re right, the generated PTX is doing individual scalar load and adds (e.g. 4 float 32 reduce sum):

ld.global.f32 |%f1, [%rd2+8];
ld.global.f32 |%f2, [%rd2];
ld.global.f32 |%f3, [%rd2+12];
ld.global.f32 |%f4, [%rd2+4];
add.rn.f32 |%f5, %f4, %f3;
add.rn.f32 |%f6, %f2, %f1;
add.rn.f32 |%f7, %f6, %f5;
st.global.f32 |[%rd2], %f7;

Will reword this section and point out that typical SIMD does not exist on GPU and you need to use SIMT with warps instead. I don’t think that transforming SIMD.reduce_add etc to warp SIMT instructions is viable because that would require launching extra threads.

All of your feedback has been very valuable cheers, I’ll get all the fixes in today.

jack.clayton · March 17, 2025, 2:26pm

Thanks IvoB I’ll add a part about updating to latest Mojo cheers

jack.clayton · March 17, 2025, 6:51pm

I’ve made changes for all suggestions that will go out will next release, just a few things where I diverged from your suggestions a bit:

Have made all the variables and aliases more descriptive, but kept in as input so they’re now host_in_buffer, device_in_buffer etc.
grid_dim and block_dim with a scalar is common, so I added a tip box explaining that you can use a scalar or tuple and what they map to
This document is used for the online docs and so can’t be split up into two files, plus a lot of the code blocks do reuse a lot of things from earlier cells like the imports and setting up buffers, which I’d have to repeat if there were multiple markdown files.

Thanks again for all the great suggestions, everything else will be fixed when the next nightly goes out.

gabrieldemarmiesse · March 17, 2025, 8:25pm

Sounds good, thanks for the fixes!

gabrieldemarmiesse · March 20, 2025, 11:47am

@jack.clayton I think I caught an error in the current tutorial:

    var shared = stack_allocation[
        threads * sizeof[dtype](),
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()

is allocating too much memory, isn’t it? Maybe

    var shared = stack_allocation[
        threads,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()

is better?

From the docs stack_allocation | Modular the first parameter should be the number of elements, not the number of bytes.

gabrieldemarmiesse · March 20, 2025, 2:12pm

Actually it seems that SIMD is possible in gpus. Maybe the gpu you were using didn’t support this? Or maybe it’s a compiler issue?

jack.clayton · March 20, 2025, 9:24pm

Thanks asked about this, there are some SIMD instructions available like SIMD load, but they have alignment requirements. e.g. this:

fn foo(data: UnsafePointer[Scalar[DType.float32]]):
    var x = data.load[
        width=4, alignment = alignof[SIMD[DType.float32, 4]]()
    ]()

Generates PTX:

ld.global.v4.f32 {%f1, %f2, %f3, %f4}, [%rd2];

Will reword again thanks, and fix that stack_allocation

purans · March 26, 2025, 2:17pm

Dumb question maybe. I get an error -
error: failed to create cuda device context to compile to cubin from gpu import thread_idx ^

/Users/puran/.modular/envs/max/bin/mojo: error: failed to run the pass manager

Using mojo version - 25.3.0.dev2025032405 (cf5f2db0) on M1 Mac

jack.clayton · March 26, 2025, 2:55pm

Hi @purans you can only run this if you have an NVIDIA GPU meeting these requirements: MAX FAQ | Modular

I should be more clear about this at the start of the tutorial, I’ll update it.

Topic		Replies	Views
Examples of programming GPU functions using the Mojo MAX Driver API MAX discussion , gpu , 25_1	5	332	April 26, 2025
New GPU programming recipes GPU Programming gpu , modular-content	0	179	March 14, 2025
Mojo manual gpu basics exercise does not compile GPU Programming 25_3	7	118	April 2, 2025
Ask Ahmed anything about GPU programming with Mojo (LLVM Developers' Meeting 2024) Community Showcase modular-content , llvm , ask-me-anything	10	704	June 8, 2025
Next-Gen GPU Programming: Hands-On with Mojo & MAX @ Modular HQ Content youtube	0	52	April 18, 2025

GPU Programming Manual

Related topics