Questions regarding puzzle 14

I have a number of questions related to the solution to this puzzle. It may seem like I am being pedantic in some of them but I’m honestly not, I just want to make sure my understanding is correct.

  1. In the solution to the tiled example is the initialization/resetting of the shared memory and resulting extra synchronization barrier() redundant as we are defensively checking the bounds for each operation. i.e. Can this be removed?

     # Reset shared memory tiles
     if local_row < TPB and local_col < TPB:
         a_shared[local_row, local_col] = 0
         b_shared[local_row, local_col] = 0
    
     barrier()
    
  2. The 5. Computation within tile section of the explanation talks about memory coalescing but it seems unclear how this relates to shared memory reads where bank conflicts increase the number of memory transactions. Am I missing something here?

  3. Technicaly don’t standard memory loads from global memory operate asynchronously? That is the warp can continue on with instructions which are not dependant on the loaded memory after the load instruction. Then once the warp needs to use that loaded memory it is stalled until it is available. The text

    Launch asynchronous memory transfers that may overlap with computation via …

    would imply to a new user that this is not the case under normal circumstances. I was also under the impression that one of the main advantage of using this seperate copy engine in CUDA was that it avoids the intermediate copy to registers. I may be wrong but this is one of the things which tripped me up when pipelining in CUDA, I couldn’t work out why pipeline with asynchronous (cp.async) copies on there own didn’t magically increase performance. The reason, the existing pipeline with standard copies were already asynchronous.

1 Like

For reference: Click through Solution: Manual tiling of Puzzle 14

Great points!

  1. Yes, you’re right! that’s redundant because the subsequent bounds-checking are covering that. So that needs to be cleaned up.
  2. Yes, it should be corrected to avoid bank conflicts rather than coalescing.
  3. Correct! global mem loads are async. I will adjust the explanation. I believe cp.async is the same as copy_dram_to_sram_async which we use in the idiomatic implementation right after the manual one. Note that cp.async has more benefits than just being async, by bypassing registers i.e. directly going from global mem to smem and has better resource utilization. Why you don’t see speed up? I think because there isn’t enough computation overlap to hide the memory latency during the copy operations.
2 Likes

Added as an issue in the repo to fix.

1 Like

@Ehsan In the idomatic tiled solution transposed access is used for accessing the elements from matrix b but there is no comment to explain why. Whilst this is a great demonstration of the flexibility of the layout approach I think there should be a note to explain that this access pattern is not optimum (coalesced) for this puzzle but will be very useful in other circumstances. e.g. If matrix b is not already transposed and you want to load it so that the shared memory access will be “coalesced” (not have bank conflicts), which I assume from the documentation would look like

alias load_b_layout = Layout.row_major(TPB, 1)
alias store_b_layout = Layout.row_major(1, TPB)
copy_dram_to_sram_async[src_thread_layout =load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)

What do you think?