Idioms for using an index tensor in kernel computations

femtomc · June 6, 2025, 3:33pm

Hi all!

I’m trying to express a particular idiom within a GPU kernel, where I grab an index out of a tensor, and then use it to index into another tensor.

The full code is below for reference, but below the code I show the relevant segment.

# GPU kernel for computing cross-covariance matrix elements
fn compute_cross_covariance_kernel[
    n: Int
](
    W_tensor: LayoutTensor[mut=True, float32, Layout.row_major(3, 3)],
    correspondences_tensor: LayoutTensor[
        mut=True, DType.int32, Layout.row_major(n)
    ],
    distances_tensor: LayoutTensor[mut=True, float32, Layout.row_major(n)],
    source_centered_tensor: LayoutTensor[
        mut=False, float32, Layout.row_major(n, 3)
    ],
    target_centered_tensor: LayoutTensor[
        mut=False, float32, Layout.row_major(n, 3)
    ],
    max_dist: Float32,
):
    # Each block computes one element of the 3x3 matrix
    row = block_idx.x
    col = block_idx.y

    if row >= 3 or col >= 3:
        return

    # Shared stack memory for partial sums
    var shared = stack_allocation[
        THREADS_PER_BLOCK,
        Scalar[float32],
        address_space = AddressSpace.SHARED,
    ]()

    tid = thread_idx.x
    var sum = 0.0

    # Each thread processes multiple correspondences
    var idx = tid
    while idx < n:
        # Check if correspondence is valid (within distance threshold)
        if distances_tensor[idx] < max_dist:
            var target_idx = correspondences_tensor[idx]
            s_val = source_centered_tensor[idx, row]
            t_val = target_centered_tensor[target_idx, col]
            sum += s_val * t_val

        idx += THREADS_PER_BLOCK

    # Store in shared memory and reduce
    shared[tid] = sum
    barrier()

    # Parallel reduction
    var stride = THREADS_PER_BLOCK // 2
    while stride > 0:
        if tid < stride:
            shared[tid] = shared[tid] + shared[tid + stride]
        barrier()
        stride //= 2

    # Thread 0 writes final result
    if tid == 0:
        W_tensor[row, col] = shared[0]

The relevant part is here:

var target_idx: Int = correspondences_tensor[idx]
s_val = source_centered_tensor[idx, row]
t_val = target_centered_tensor[target_idx, col]
sum += s_val * t_val

trying this as-is runs into a resolution error (as far as I can tell).

Is there a known idiom for this? I did a cursory search through the GPU puzzles but not experienced enough to find a good match.

lukas · June 6, 2025, 4:32pm

The issue here is that LayoutTensor.__getitem__ returns a SIMD[dtype, Self.element_size] (basically a Tuple of size 1 in your case). You can see that from the error message (though we should clean that up a little). If you wrote correspondences_tensor[idx][0] or Int(correspondences_tensor[idx]) it would work as an index. We are actively exploring how to make that API better.

femtomc · June 6, 2025, 4:33pm

Doing some rebind hacking allows me to compile, but running into

double free or corruption (fasttop)

when attempting to run. Debugging that now (assuming that my usage of rebind is okay):

target_idx = rebind[Int](correspondences_tensor[idx])
s_val = source_centered_tensor[idx, row]
t_val = target_centered_tensor[target_idx, col]
sum += rebind[Float32](s_val * t_val)

Edit: @lukas fix above worked, and removed my memory problem.

femtomc · June 6, 2025, 4:36pm

Surprised to find that Int(...) works, but Float32(...) doesn’t when used in the same capacity:

requiring a rebind (to work)

lukas · June 6, 2025, 4:37pm

If you think about it, Float32(source_centered_tensor[Idx, row]) is writing:

SIMD[DType.float32, 1](SIMD[DType.float32, N](...))

which doesn’t really make sense

lukas · June 6, 2025, 4:38pm

Int isn’t an alias for SIMD, it’s a unique type closer to isize in Rust (or ssize_t in C++)

femtomc · June 6, 2025, 4:39pm

Excellent, thanks – was just about to ask what the difference between Int and Float32 is.

lukas · June 6, 2025, 4:40pm

That being said, rather than rebind, you can write source_centered_tensor[idx, row][0]

Topic		Replies	Views
GPU Puzzles P09 Shared memory indexing issue Standard Library gpu	2	96	June 27, 2025
LayoutTensor - Type conversion Issue GPU Programming mojo-compiler	2	81	July 1, 2025
My experience with using LayoutTensors Standard Library	1	133	September 8, 2025
How to package/interface with a GPU kernel with dynamic sized tensors (dynamic LayoutTensor) GPU Programming	15	322	July 12, 2025
Calling helper function with LayoutTensor value Mojo	3	78	May 21, 2025

Idioms for using an index tensor in kernel computations

Related topics