Hello
I was going through the gpu puzzles and got p06 where the input array is split between multiple thread blocks. It looks like it fills each thread block up with data from the input array before going to the next thread block.
comptime SIZE = 9
comptime BLOCKS_PER_GRID = (3, 1)
comptime THREADS_PER_BLOCK = (4, 1)
comptime dtype = DType.float32
def add_10_blocks(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin
size: Int,
):
var i = block_dim.x * block_idx.x + thread_idx.x
And then the input array fills up each block before going to the next one. Is there a way to specify how to distribute the data from the array into a black? For example what if I just wanted the every 2nd thread of the block to have data? So for example given
a = [1..9]
block 0, thread 0 reads index a[0]
block 0, thread 1 reads nothing
block 0, thread 2 reads index a[1]
block 0, thread 3 reads nothing