How to distribute data into blocks?

Hello

I was going through the gpu puzzles and got p06 where the input array is split between multiple thread blocks. It looks like it fills each thread block up with data from the input array before going to the next thread block.

comptime SIZE = 9
comptime BLOCKS_PER_GRID = (3, 1)
comptime THREADS_PER_BLOCK = (4, 1)
comptime dtype = DType.float32

def add_10_blocks(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin
size: Int,
):
var i = block_dim.x * block_idx.x + thread_idx.x

And then the input array fills up each block before going to the next one. Is there a way to specify how to distribute the data from the array into a black? For example what if I just wanted the every 2nd thread of the block to have data? So for example given

a = [1..9]

block 0, thread 0 reads index a[0]
block 0, thread 1 reads nothing
block 0, thread 2 reads index a[1]
block 0, thread 3 reads nothing

You can use the values exposed by std.gpu.primitives.id as well as a bit of math to have each thread come up with an arbitrary index. Mapping the thread count and 1d array length directly to each other is more of a convention than an actual rule.

For your desired pattern, you can do this:

var i = block_dim.x * block_idx.x + thread_idx.
if i % 2 == 0:
    var a_value = a.load(i // 2)

If you have experience with SIMD on CPUs, GPU “threads” are actually just SIMD lanes on a 1024-bit SIMD unit (for NVIDIA and AMD RDNA). So, if you want to do tile-level programming you need to take a bit of a mental step back from what CUDA is trying to get you to do.