I am looking at GPU problem p09 with LayoutTensor
, and I have this solution:
fn pooling[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
barrier()
if global_i < size:
acc = Float32(0)
for j in range(max(local_i-2, 0), local_i+1):
acc += shared[j][0]
output[global_i] = acc
For some reason, it seems to fail for thread_idx = 1
out: HostBuffer([0.0, 0.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])
Unhandled exception caught during execution: At /home/arseni/repositories/mojo-gpu-puzzles/problems/p09/p09_layout_tensor.mojo:80:29: AssertionError: left == right comparison failed: left: 0.0
I can see the suggested solution in the docs, but I want to understand why this does not work, or how you would debug something like this in Mojo/lldb. Is there some crux behind indexing into shared/SIMD memory? Or is it just some indexing issue?