I am going through the gpu puzzles and really enjoying it, thank you very much for this awesome treat!
I am currently at problem 11 and the solution is correct, but a bit confusing around initialization of shared memory and performance claims. Maybe it is my inexperience, so please correct me if I am wrong.
Since we add the condition if global_i + j < SIZE_2
inside the computation of the actual convolution, we dont need to initialize shared_a with zeroes. But the solution makes point of being extra safe by doing so anyway, but only for shared_a on indexes [TPB, TPB+CONV_2-1). I believe that the same should be done for indexes [0, TPB).
The solution also claims to be faster by adding the same condition if global_i + j < SIZE_2
, but that does not feel correct either, because for large enough matrices this condition will mostly result in True not doing that much prunning anyway. And even when it would result in False, there will likely be a thread in the same warp for which the condition would be True and the whole warp would have to proceed into the if branch. So at the end the if would be redundant. I am not 100% sure if my knowledge of how divergence of warps would be handled, so please correct me if I am wrong.
The idea in code:
fn conv_1d_block_boundary[
in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
output: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, conv_layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
# NOTE: initialize shared_a on indexes [0, TPB + CONV_2 - 1) with non-zero value
# shared_a[local_i] = 10
# if local_i + TPB < TPB + CONV_2 - 1:
# shared_a[local_i + TPB] = 10
# NOTE: initialize shared_a on indexes [0, TPB)
if global_i < SIZE_2:
shared_a[local_i] = a[global_i]
# else:
# shared_a[local_i] = 0
# NOTE: initialize shared_a on indexes [TBP, TPB + CONV_2 - 1)
if local_i < CONV_2 - 1:
if global_i + TPB < SIZE_2:
shared_a[local_i + TPB] = a[global_i + TPB]
# else:
# shared_a[local_i + TPB] = 0
if local_i < CONV_2:
shared_b[local_i] = b[local_i]
barrier()
if global_i < SIZE_2:
var local_sum: output.element_type = 0
@parameter
for j in range(CONV_2):
if global_i + j < SIZE_2:
local_sum += shared_a[local_i + j] * shared_b[j]
output[global_i] = local_sum