How to distribute data into blocks?

sashang · June 1, 2026, 2:01pm

Hello

I was going through the gpu puzzles and got p06 where the input array is split between multiple thread blocks. It looks like it fills each thread block up with data from the input array before going to the next thread block.

comptime SIZE = 9
comptime BLOCKS_PER_GRID = (3, 1)
comptime THREADS_PER_BLOCK = (4, 1)
comptime dtype = DType.float32

def add_10_blocks(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin
size: Int,
):
var i = block_dim.x * block_idx.x + thread_idx.x

And then the input array fills up each block before going to the next one. Is there a way to specify how to distribute the data from the array into a black? For example what if I just wanted the every 2nd thread of the block to have data? So for example given

a = [1..9]

block 0, thread 0 reads index a[0]
block 0, thread 1 reads nothing
block 0, thread 2 reads index a[1]
block 0, thread 3 reads nothing

owenhilyard · June 1, 2026, 4:50pm

You can use the values exposed by std.gpu.primitives.id as well as a bit of math to have each thread come up with an arbitrary index. Mapping the thread count and 1d array length directly to each other is more of a convention than an actual rule.

For your desired pattern, you can do this:

var i = block_dim.x * block_idx.x + thread_idx.
if i % 2 == 0:
    var a_value = a.load(i // 2)

If you have experience with SIMD on CPUs, GPU “threads” are actually just SIMD lanes on a 1024-bit SIMD unit (for NVIDIA and AMD RDNA). So, if you want to do tile-level programming you need to take a bit of a mental step back from what CUDA is trying to get you to do.

Topic		Replies	Views
How do I sync threads between blocks i.e. device-wide? GPU Programming	7	206	August 14, 2025
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	76	October 16, 2025
GPU puzzles mental model Mojo gpu , gpu_puzzle	4	164	May 12, 2026
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	263	July 4, 2025
Purpose of num_threads in copy_dram_to_sram_async GPU Programming	1	93	July 22, 2025

How to distribute data into blocks?

Related topics