Spoiler alert: I’m asking a question which references the solution to puzzle 4: 2D Map from the Modular puzzle exercises
Modular’s puzzle series is my first time working with any sort of GPU programming. I’ve been reading through the puzzle docs along with doing the puzzles. I’m now trying to understand the “Raw Memory Approach” solution to Puzzle 4.
When talking about GPU threads, I think I understand the basic concept of (i, j) coordinates. Perhaps not identical to, but similar to x, y coordinates on a map, or other “chart-like” object. Rows and columns. However, the solution listed says:
1. row * size + col seems like it would evaluate to a single value, not a pair of “coordinate values” I was expecting.
I’m also not sure I understand what size is doing here (I mean functionally, not the basic multiplication) and why we need it.
These combined, makes me think my mental model of GPU threads is at least off the mark, if not completely wrong. Was wondering if anyone might be able to help or point me towards other useful resources.
Thanks!
And also thanks to those on the Modular team for making these puzzles!
Someone else probably can give a more detailed answer. I haven’t worked deeply with the mojo puzzles, but I have been working through a cuda c programming textbook and have been writing the mojo equivalents to their cuda c code.
The 2/3d coords are purely logical / mental. The actual memory is 1 dimensional.
At the end of the day you’re working with a 1d global memory. The x,y,z is purely logical / mental concept for humans. This is important for gpu’s original purpose of x,y pixels on an image. When you’re writing a image processor algorithm, x,y is more intuitive than just 1d ops. And in ML we have convolutional neural networks that have multiple dimensions.
For example below is CUDA C thats summing 2 “2d” matrices together where nx and ny is the width and height.
__global__ void sumMatrixOnGPU2D(float *MatA, float *MatB, float *MatC, int nx,
int ny) {
unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
// idx is the global linear memory idx.
unsigned int idx = iy * nx + ix;
if (ix < nx && iy < ny) {
MatC[idx] = MatA[idx] + MatB[idx];
}
}
Additionally I think in later puzzles they present Using LayoutTensor | Modular which probably do what you’re looking for in x/y indexing, however their backend unlying storage is still a linear layout.
If you look at the explicit declarations of out and a:
var a: DeviceBuffer[dtype] = ctx.enqueue_create_buffer[dtype](SIZE * SIZE).enqueue_fill(0)
var out: DeviceBuffer[dtype] = ctx.enqueue_create_buffer[dtype](SIZE * SIZE).enqueue_fill(0)
it’s clear that both are contiguous blocks of GPU memory, each large enough to hold SIZE × SIZE elements.
The kernel uses two-dimensional thread indices (thread_idx.x, thread_idx.y), so you need to map those 2D coordinates into a single linear memory address:
var row: UInt = thread_idx.y
var col: UInt = thread_idx.x
if (row < size) and (col < size):
var i: UInt = row * size + col
output[i] = a[i] + 10