Synopsis :
- Quick recap *
- Data transfer optimizations (pinned memory, zero copy, cuda managed memory) *
- concurrent execution (streams, events, levels of synchronization across warps/blocks) *
- Kernel optimizations (warps, impact of branches, global/constant/shared memory in detail espacially bank conflicts)
- overall GPU efficiency (occupancy, roofline model)
- Hardware specific behaviours (Kepler, Pascal, Volta) key differences for the programmer
- Compilation of CUDA in detail (execution model)
- Multi GPU (device management, CUDA context, Peer2Peer in CUDA, NV-Link, CUDA + MPI (gdr-copy), Multi-Process-Service mps)
- Advanced profiling (nvidia-smi, nvprof, nvvp)
* = overlap with training "CUDA basics"