Gpu-puzzles: initialization of shared_a in problem 11

TotalOutlier · July 5, 2025, 7:56pm

I am going through the gpu puzzles and really enjoying it, thank you very much for this awesome treat!

I am currently at problem 11 and the solution is correct, but a bit confusing around initialization of shared memory and performance claims. Maybe it is my inexperience, so please correct me if I am wrong.

Since we add the condition if global_i + j < SIZE_2 inside the computation of the actual convolution, we dont need to initialize shared_a with zeroes. But the solution makes point of being extra safe by doing so anyway, but only for shared_a on indexes [TPB, TPB+CONV_2-1). I believe that the same should be done for indexes [0, TPB).

The solution also claims to be faster by adding the same condition if global_i + j < SIZE_2, but that does not feel correct either, because for large enough matrices this condition will mostly result in True not doing that much prunning anyway. And even when it would result in False, there will likely be a thread in the same warp for which the condition would be True and the whole warp would have to proceed into the if branch. So at the end the if would be redundant. I am not 100% sure if my knowledge of how divergence of warps would be handled, so please correct me if I am wrong.

The idea in code:


fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    output: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    
    shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()

    # NOTE: initialize shared_a on indexes [0, TPB + CONV_2 - 1) with non-zero value
    # shared_a[local_i] = 10
    # if local_i + TPB < TPB + CONV_2 - 1:
    #     shared_a[local_i + TPB] = 10

    # NOTE: initialize shared_a on indexes [0, TPB)
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    # else:
    #     shared_a[local_i] = 0

    # NOTE: initialize shared_a on indexes [TBP, TPB + CONV_2 - 1)
    if local_i < CONV_2 - 1:
        if global_i + TPB < SIZE_2:
            shared_a[local_i + TPB] = a[global_i + TPB]
        # else:
        #     shared_a[local_i + TPB] = 0

    if local_i < CONV_2:
        shared_b[local_i] = b[local_i]
    
    barrier()

    if global_i < SIZE_2:
        var local_sum: output.element_type = 0
        @parameter
        for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]
        
        output[global_i] = local_sum

Ehsan · July 6, 2025, 11:53pm

You’re right about initialization! Just added the manual way here. Later in puzzle 14, you’ll see a slightly more idiomatic way using .fill(0) on the allocated tensor.

You’re mostly correct about the warp divergence, but the main point of the condition if global_i + j < SIZE_2 is about correctness and not performance.

TotalOutlier · July 10, 2025, 11:43pm

Thank you for the reply!

Topic		Replies	Views
GPU Puzzles P09 Shared memory indexing issue Standard Library gpu	2	72	June 27, 2025
Leetgpu, tensara how to handle shared memory? GPU Programming gpu	1	54	June 26, 2025
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	54	July 4, 2025
Questions regarding puzzle 14 GPU Programming	9	82	July 8, 2025
Idioms for using an index tensor in kernel computations GPU Programming	7	67	June 6, 2025

Gpu-puzzles: initialization of shared_a in problem 11

Related topics