Idioms for using an index tensor in kernel computations

Hi all!

I’m trying to express a particular idiom within a GPU kernel, where I grab an index out of a tensor, and then use it to index into another tensor.

The full code is below for reference, but below the code I show the relevant segment.

# GPU kernel for computing cross-covariance matrix elements
fn compute_cross_covariance_kernel[
    n: Int
](
    W_tensor: LayoutTensor[mut=True, float32, Layout.row_major(3, 3)],
    correspondences_tensor: LayoutTensor[
        mut=True, DType.int32, Layout.row_major(n)
    ],
    distances_tensor: LayoutTensor[mut=True, float32, Layout.row_major(n)],
    source_centered_tensor: LayoutTensor[
        mut=False, float32, Layout.row_major(n, 3)
    ],
    target_centered_tensor: LayoutTensor[
        mut=False, float32, Layout.row_major(n, 3)
    ],
    max_dist: Float32,
):
    # Each block computes one element of the 3x3 matrix
    row = block_idx.x
    col = block_idx.y

    if row >= 3 or col >= 3:
        return

    # Shared stack memory for partial sums
    var shared = stack_allocation[
        THREADS_PER_BLOCK,
        Scalar[float32],
        address_space = AddressSpace.SHARED,
    ]()

    tid = thread_idx.x
    var sum = 0.0

    # Each thread processes multiple correspondences
    var idx = tid
    while idx < n:
        # Check if correspondence is valid (within distance threshold)
        if distances_tensor[idx] < max_dist:
            var target_idx = correspondences_tensor[idx]
            s_val = source_centered_tensor[idx, row]
            t_val = target_centered_tensor[target_idx, col]
            sum += s_val * t_val

        idx += THREADS_PER_BLOCK

    # Store in shared memory and reduce
    shared[tid] = sum
    barrier()

    # Parallel reduction
    var stride = THREADS_PER_BLOCK // 2
    while stride > 0:
        if tid < stride:
            shared[tid] = shared[tid] + shared[tid + stride]
        barrier()
        stride //= 2

    # Thread 0 writes final result
    if tid == 0:
        W_tensor[row, col] = shared[0]

The relevant part is here:

var target_idx: Int = correspondences_tensor[idx]
s_val = source_centered_tensor[idx, row]
t_val = target_centered_tensor[target_idx, col]
sum += s_val * t_val

trying this as-is runs into a resolution error (as far as I can tell).

Is there a known idiom for this? I did a cursory search through the GPU puzzles but not experienced enough to find a good match.

The issue here is that LayoutTensor.__getitem__ returns a SIMD[dtype, Self.element_size] (basically a Tuple of size 1 in your case). You can see that from the error message (though we should clean that up a little). If you wrote correspondences_tensor[idx][0] or Int(correspondences_tensor[idx]) it would work as an index. We are actively exploring how to make that API better.

2 Likes

Doing some rebind hacking allows me to compile, but running into

double free or corruption (fasttop)

when attempting to run. Debugging that now (assuming that my usage of rebind is okay):

target_idx = rebind[Int](correspondences_tensor[idx])
s_val = source_centered_tensor[idx, row]
t_val = target_centered_tensor[target_idx, col]
sum += rebind[Float32](s_val * t_val)

Edit: @lukas fix above worked, and removed my memory problem.

Surprised to find that Int(...) works, but Float32(...) doesn’t when used in the same capacity:

requiring a rebind (to work)

If you think about it, Float32(source_centered_tensor[Idx, row]) is writing:

SIMD[DType.float32, 1](SIMD[DType.float32, N](...))

which doesn’t really make sense

1 Like

Int isn’t an alias for SIMD, it’s a unique type closer to isize in Rust (or ssize_t in C++)

1 Like

Excellent, thanks – was just about to ask what the difference between Int and Float32 is.

That being said, rather than rebind, you can write source_centered_tensor[idx, row][0]