GPU Puzzles P09 Shared memory indexing issue

I am looking at GPU problem p09 with LayoutTensor, and I have this solution:

fn pooling[
    layout: Layout
](
    output: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i < size:
        acc = Float32(0)
        for j in range(max(local_i-2, 0), local_i+1):
            acc += shared[j][0]

        output[global_i] = acc

For some reason, it seems to fail for thread_idx = 1

out: HostBuffer([0.0, 0.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]) 
Unhandled exception caught during execution: At /home/arseni/repositories/mojo-gpu-puzzles/problems/p09/p09_layout_tensor.mojo:80:29: AssertionError: left == right comparison failed: left: 0.0 

I can see the suggested solution in the docs, but I want to understand why this does not work, or how you would debug something like this in Mojo/lldb. Is there some crux behind indexing into shared/SIMD memory? Or is it just some indexing issue?

local_i is UInt, so local_i-2 may underflow.

2 Likes

Youre right, wrapping the local_i-2 in Int solves the problem, thank you.