Questions regarding puzzle 14

cudawarped · June 26, 2025, 8:51am

I have a number of questions related to the solution to this puzzle. It may seem like I am being pedantic in some of them but I’m honestly not, I just want to make sure my understanding is correct.

In the solution to the tiled example is the initialization/resetting of the shared memory and resulting extra synchronization barrier() redundant as we are defensively checking the bounds for each operation. i.e. Can this be removed?
```
 # Reset shared memory tiles
 if local_row < TPB and local_col < TPB:
     a_shared[local_row, local_col] = 0
     b_shared[local_row, local_col] = 0

 barrier()
```
The 5. Computation within tile section of the explanation talks about memory coalescing but it seems unclear how this relates to shared memory reads where bank conflicts increase the number of memory transactions. Am I missing something here?
Technicaly don’t standard memory loads from global memory operate asynchronously? That is the warp can continue on with instructions which are not dependant on the loaded memory after the load instruction. Then once the warp needs to use that loaded memory it is stalled until it is available. The text

Launch asynchronous memory transfers that may overlap with computation via …

would imply to a new user that this is not the case under normal circumstances. I was also under the impression that one of the main advantage of using this seperate copy engine in CUDA was that it avoids the intermediate copy to registers. I may be wrong but this is one of the things which tripped me up when pipelining in CUDA, I couldn’t work out why pipeline with asynchronous (cp.async) copies on there own didn’t magically increase performance. The reason, the existing pipeline with standard copies were already asynchronous.

DarinSimmons · June 26, 2025, 4:35pm

For reference: Click through Solution: Manual tiling of Puzzle 14

Ehsan · June 26, 2025, 10:43pm

Great points!

Yes, you’re right! that’s redundant because the subsequent bounds-checking are covering that. So that needs to be cleaned up.
Yes, it should be corrected to avoid bank conflicts rather than coalescing.
Correct! global mem loads are async. I will adjust the explanation. I believe cp.async is the same as copy_dram_to_sram_async which we use in the idiomatic implementation right after the manual one. Note that cp.async has more benefits than just being async, by bypassing registers i.e. directly going from global mem to smem and has better resource utilization. Why you don’t see speed up? I think because there isn’t enough computation overlap to hide the memory latency during the copy operations.

Ehsan · June 26, 2025, 10:46pm

Added as an issue in the repo to fix.

cudawarped · June 30, 2025, 9:52am

@Ehsan In the idomatic tiled solution transposed access is used for accessing the elements from matrix b but there is no comment to explain why. Whilst this is a great demonstration of the flexibility of the layout approach I think there should be a note to explain that this access pattern is not optimum (coalesced) for this puzzle but will be very useful in other circumstances. e.g. If matrix b is not already transposed and you want to load it so that the shared memory access will be “coalesced” (not have bank conflicts), which I assume from the documentation would look like

alias load_b_layout = Layout.row_major(TPB, 1)
alias store_b_layout = Layout.row_major(1, TPB)
copy_dram_to_sram_async[src_thread_layout =load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)

What do you think?

juaneco · July 5, 2025, 9:09pm

Hi all, I was also puzzled by this. I don’t understand why we needed to use:

alias load_b_layout = Layout.row_major(TPB, 1)
copy_dram_to_sram_async[thread_layout=load_b_layout](b_shared, b_tile)

For b_shared. My understanding is that the “thread_layout” is only used for copying the data, but has nothing to do with the final layout of b_shared. In the end, b_shared is still not transposed, since we are accessing it as:

acc += a_shared[local_row, k] * b_shared[k, local_col]

In this case, it seems to me using using load_b_layout = load_a_layout = Layout.row_major(1,TPB) would be more efficient. Am I missing something? @Ehsan @cudawarped

Ehsan · July 7, 2025, 12:23am

@cudawarped you’re correct about the non-coalesced load. I just realized that so thank you!

alias load_b_layout = Layout.row_major(TPB, 1)
alias store_b_layout = Layout.row_major(1, TPB)
copy_dram_to_sram_async[src_thread_layout =load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)

however, this transposes the data during copy so b_shared[local_col, k] actually represents B[local_col, k] since the puzzle is doing AB^T.

Instead as @juaneco suggested, both matrices can use the same simply coalesced loading

alias load_b_layout = Layout.row_major(1, TPB)

This works because the thread layout determines how threads cooperate during the copy, not the final data layout. With (1, TPB), threads load consecutive elements from the same row.

Your suggested pattern would be useful when B is already transposed in global memory and you want to un-transpose it during loading for correct matrix multiplication.

cudawarped · July 7, 2025, 8:25am

For b_shared. My understanding is that the “thread_layout” is only used for copying the data, but has nothing to do with the final layout of b_shared.

@juaneco That is correct. In the solution it was unclear to me what the author was trying to show by purposefully using uncoalesced (“transposed”) loads. My suggestion was based on a guess that the author was trying to demonstrate the flexibility of copy_dram_to_sram_async in some way with a transposition (or there was an error).

Given that the puzzle assumes that the matrix b is already transposed (acc += a_shared[local_row, k] * b_shared[k, local_col], not acc += a_shared[local_row, k] * b_shared[local_col, k]) I was suggesting that the author might be trying to show how to perform transposition while loading from global memory which would be useful if the matrix b was not already transposed.

@Ehsan I still think the solution is a bit confusing for a new user. Given the importance of understanding this fundamental operation I just wanted to clarify a few things.

Do you mean not already transposed?

From the solution

Key insight: Thread layout determines cooperation during copy, not final data layout

Does this mean that it is not advisable to use the src_thread_layout and dst_thread_layout flags to transpose during loading?

Result: Coalesced access for both A and B with Layout.row_major(1, TPB)

The layout used for loading from global memory is independent of the way the individual threads load from shared memory. Avoiding bank conflicts is determined by the layout of shared memory (row or column major) and the access pattern in the loop.

Ehsan · July 7, 2025, 6:36pm

To be clear, we’re doing A x B and not A x B^T. My point is thread layout isn’t the same as data layout so loading Layout.row_major(1, TPB) is optimal for coalesced access. Let me explain in details:

First regarding the transpose, there are three ways normally implemented in libs:

Matmul in puzzle 14: B is stored normally in global memory and we load it normally to do A x B.
Transposed loading: B is stored normally in global memory but we want to load it as B^T to do A x B^T.
Un-transposed: B is already stored transposed (B^T) in global memory but we want to load it as normal B for A x B.

Hope this part is clear now.

Does this mean that it is not advisable to use the src_thread_layout and dst_thread_layout flags to transpose during loading?

No, it’s perfectly a fine option. Sorry for the confusion! What I meant earlier is that thread cooperation in Layout.row_major(1, TPB) means threads work together to do coalesced load. If we do the data transformation

copy_dram_to_sram_async[src_thread_layout =load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)

we’d end up with transposed data stored in shared memory which still has its declared layout row_major[TPB, TPB] version. The threads layouts (1, TPB) and (TPB, 1) are just about how threads cooperate during copy.

The layout used for loading from global memory is independent of the way the individual threads load from shared memory. Avoiding bank conflicts is determined by the layout of shared memory (row or column major) and the access pattern in the loop.

Exactly! There’re these separate concerns:

Thread cooperation layout: Layout.row_major(1, TPB) how threads work together during copy (always aim for coalesced access)
Data transformation: src_thread_layout vs dst_thread_layout which can reshape/transpose data during copy
Shared memory layout: row_major[TPB, TPB] which is the actual tensor layout in shared memory (declared when creating the tensor)
Bank conflicts: Determined by the shared memory access pattern in the computation loop (a_shared[local_row, k] * b_shared[k, local_col])

In our case, both matrices use Layout.row_major(1, TPB) for optimal coalesced loading, and the computation pattern a_shared[local_row, k] * b_shared[k, local_col] avoids bank conflicts while performing standard A x B multiplication.

The confusion in the original documentation was mixing up these different concepts and not being clear upfront that this is standard A x B multiplication. Hope this part is clear now. I’ll enhance the explanation of puzzle 14 given these discussions.

cudawarped · July 8, 2025, 5:24am

@Ehsan Thank you for taking the time to clarify that. It seems that I managed to confuse myself there for a moment

Topic		Replies	Views
Question regarding `copy_dram_to_sram_async` in Puzzle 16 MatMul GPU Programming gpu_puzzle	1	38	October 16, 2025
Puzzle 23. Why use strided loading of tiles? GPU Programming gpu_puzzle	4	77	October 3, 2025
Tiled Matrix Multiplication Puzzle GPU Programming gpu_puzzle	2	73	July 4, 2025
Modular: Matrix Multiplication on Blackwell: Part 2 - Using Hardware Features to Optimize Matmul Content blog	3	65	September 6, 2025
GPU Puzzles P09 Shared memory indexing issue Standard Library gpu	2	94	June 27, 2025

Questions regarding puzzle 14

Related topics