I have a number of questions related to the solution to this puzzle. It may seem like I am being pedantic in some of them but I’m honestly not, I just want to make sure my understanding is correct.
-
In the solution to the tiled example is the initialization/resetting of the shared memory and resulting extra synchronization
barrier()
redundant as we are defensively checking the bounds for each operation. i.e. Can this be removed?# Reset shared memory tiles if local_row < TPB and local_col < TPB: a_shared[local_row, local_col] = 0 b_shared[local_row, local_col] = 0 barrier()
-
The 5. Computation within tile section of the explanation talks about memory coalescing but it seems unclear how this relates to shared memory reads where bank conflicts increase the number of memory transactions. Am I missing something here?
-
Technicaly don’t standard memory loads from global memory operate asynchronously? That is the warp can continue on with instructions which are not dependant on the loaded memory after the load instruction. Then once the warp needs to use that loaded memory it is stalled until it is available. The text
Launch asynchronous memory transfers that may overlap with computation via …
would imply to a new user that this is not the case under normal circumstances. I was also under the impression that one of the main advantage of using this seperate copy engine in CUDA was that it avoids the intermediate copy to registers. I may be wrong but this is one of the things which tripped me up when pipelining in CUDA, I couldn’t work out why pipeline with asynchronous (cp.async) copies on there own didn’t magically increase performance. The reason, the existing pipeline with standard copies were already asynchronous.