Introduction

Educational Resources

CUDA Programming Guide

The CUDA Documentation contains best practices and a programming guide that establishs many of the patterns that result in optimal CUDA performance. This is a critical reference and daily resource for any level of developer.

CUDA Documentation

CUDA Programming Guide

NVIDIA On Demand

NVIDIA On Demand is a great resource for video format learning. This contains recordings of all past GTC (GPU Technology Conference) talks, covering topics from begginers to the most advanced.

NVIDIA On Demand

Suggested Talks

NVIDIA Blogs

NVIDIA Blogs is the written format equivilant of our GTC talks. most blogs focus on a single technical topic and provide a detailed tutorial, write-up, or demonstration of the given technology. Simiarly to the GTC talks, topics can range from deeply technincal to introductory, on topics from CUDA to precise applications in their scientific field.

NVIDIA Blog Front page

Suggested Blogs

Architecture White Papers

An Architecture white paper is released with every new GPU architecture, and is the "ground truth" for HW changes and it's associated capability changes.

GPU HW Overview

Types of HW

comparison of HW diagrams from whitepapers

Compute HW Resources

GA100	GA102

Ampere Compute Capability Core Description

CUDA Cores
Tensor Cores
Special Function Unit
Raytracing Cores
- Optix Documentation

Memory HW Resources

HW Memory Types and Hierarchy

Technical Blog on Memory
- Global(Device) Memory
- L1/L2 Cache
- Shared Memory
- Registers
Specialty Memory Types

CUDA Documentation on Memory Accesses
- Constant Memory
- Textures
- Local Memory
Memory Alignment and Optimization

CUDA Documentation on Memory Alignment

Operation Specific Dedicated HW

Copy Engines
Codecs

CUDA work breakdown to HW

HOW GPU COMPUTING WORKS Stephen Jones, GTC 2021: 64-66

Balancing HW to Maximize Performance

HOW CUDA PROGRAMMING WORKS Stephen Jones, GTC 2022: 49-84

Generational Changes in GPU HW

Compute Capabilites Doc: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

Key Performance Drivers and Bottlenecks

Speed of Light

NVIDIA Concept of "How fast can something possibly be done". Said another way, "what is the absolute limit of performance for a given problem on a given device".

Very useful to use SoL as a metric to understand your current solution. To calcualte SoL, you need to understand the ype of compute you have, the type and size of data it requries, and the relatiosnhip of the data and compute relative to each data. This, combined with the possible FLOPS/bandwidth of a system can establish a possible SoL for a given problem. This SoL can then be used as a metric for your own solutions.

Often we use CUB or other highly optimized CUDA libraries as a default "SoL", however in many cases those may still be some percent off true SoL.

Memory Bound vs Compute Bound

Fundimentally defines: is my performance limited by the speed of my memory, or the speed of my processor?

Compute Bound

When the compute HW can not compute results for a given set of memory faster than the memory can retrieve new memory
Memory Bound:

When the Memory bandwidth cannot provide new memory input within the time it takes for compute to complete a computation for a given segment of memory

Example Calcuation FP32 10x10 convolution kernel

4096*4096 image with a 10x10 kernel 
H100
51200 GFLOPS FP32
2.04TB/s

each pixel requires 100multiples + 99 adds, 199 opers per pixel 

16.77M pixels in image

3.355 GFLOP per image

GPU can do 51,200 GFLOPS/s

GPU can process 15,250 images per second

each images is 16.77M FP32 numbers (4 Bytes) 
67MB per image

need 1.023 TB/s of bandwidth (< 2TB/s HW can do)

Compute Limited!

Example Calcuation 5x5 convolution kernel FFT

4096*4096 image with a 5x5 kernel 
H100
51,200 GFLOPS FP32
2.04TB/s

each pixel requires 25 multiples + 24 adds, 49 opers per pixel 

16.77M pixels in image

0.822 GFLOP per image

GPU can do 51200 GFLOPS/s

GPU can process 62,287 images per second

each images is 16.77M FP32 numbers (4 Bytes) 
67MB per image

need 4.18 TB/s of bandwidth (>2TB/s HW can do)

Bandwidth Limited!

This example is very "macroscopic" and real-world performance is further limited by global memory latencies, efficiency of workbreakdown, or other factors of the GPU execution. This is still a useful baseline for understanding the performance limiters we can expect in the real kernel, as we will see scaled down versions of these bottlenecks in practice.

Latency vs Throughput

HOW GPU COMPUTING WORKS Stephen Jones, GTC 2021: 41-55

GPU Occupancy

How To Write A CUDA Program:THE NINJA EDITION Stephen Jones, NVIDIA | GTC 2024 : 12-28

Techniques for Accelerating Code

Optimizing Data Allocations

Using the correct type of memory
Creating Plans and Data allocations ahead of time

Optimizing Data Movement

Conscientious Copies
Pinned vs un-Pinned copies
Impact of Types of CUDA Memory

Host APIs vs Device APIs

most CUDA libraries have versions that you can call directly from the host (cuFFT), and a version you can call while on the device (cufftDx)

In General, the Host API is intended as the easier to use, Good-to-SoL performance entrypoint for a developer. for "good" sizes and sufficiently large work, the host API can provide the best performance.

In many scenarios however, you may have problem sizes that are too small to saturate a GPU, have awkward and non-performant sizes, or simply be a non-optimized corner case for the host API. In these situations, the device API enables developers to create a more precise launch configuration that appropriately matches the problem. This also allows developers to fuse other operations with the core mathmatical operation, providing additional speed up.

CUDA Streams

CUDA Graphs

Kernel Stalls

Kernel stalls are the distinct, instruction-level impact of memory, compute, or other resource contention that slows down a kernel. By evaluating (and eliminating) kernel stalls, we can further accelerlate a well-designed kernel.

disclaimer: This level of analysis should only be conducted once you have refined a good kernel-level solution. If you have poorly thought-out or optimized data accesss or algorithmic patterns, optimizing at the instruction level will result in mediocre speedups on a poor design. This level of optimization cannot fundimentally correct those higher level issues.

types of stalls

Table of Stalls from NCU Documentation

Most common stalls

Short/Long Scoreboard Stall
Stall MIO Throttle
Stall Math Throttle
Stall Wait / not selected

Kernel Expensive Operations

Atomics
Trig
Syncrhonization

CUDA Profiling Tools Overview

Nsight Systems

Nsight Systems Landing Page

Nsight Systems Documentation

Two Functionality to be familiar with:

Command Line collection (nsys)
In-App Review (Full Windows App)

Interpretation Guidance

(In Person Demonstration)

Nsight Compute

Nsight Compute Landing Page

Nsight Compute Documentation

Two Functionality to be familiar with:

Command Line collection (ncu)
In-App Review (Full Windows App)

Interpretation Guidance

(In Person Demonstration)

Nsight Graphics

Interpretation Guidance

Profiling & Optimizing CUDA Math Libraries

CUTLASS Profiler

Documentation on Profiling CUTLASS

Suggested use patterns and scripting

Start with Host API
move plan and setup code out of critical loop as much as possible
Evaluate performance relative to Speed of Light
determine if additional operations can be fused

nsys/Nsight Systems Examples

Below are a set of common, useful command options. They can be combined and enabled all in a single report.

Command	Notes
`nsys profile -o outputName ./myExec`	basic command. will not automaticall overwrite, requires `-f` flag
`nsys profile --gpu-metrics-device=0 ./myExec`	Adds the GPU Metrics section to the report. requires Elevated Permissions
`nsys profile --cuda-graph-trace=node ./myExec`	Shows Kernel information internal to CUDA Graph Node
`nsys profile --cuda-memory-usage=true ./myExec`	Adds GPU memory usage section to report

ncu/Nsight Compute Examples

Below are a set of common, useful command options. They can be combined and enabled all in a single report. I suggest collecting the "full" set, unless you know you want a specific subset of the report.

Command	Notes
`ncu --set=full -o outputName ./myExec`	basic command. will not automaticall overwrite, requires `-f` flag
`ncu --set=full --import-source=true -o outputName ./myExec`	adds source collection. requires `-lineinfo` in compilation
`ncu --set=full -k kernelName -o outputName ./myExec`	Only collects profile for a specific kernel in the exection

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
examples		examples
images		images
README.md		README.md
RealTimeTips.md		RealTimeTips.md

tylera-nvidia/gpuStarterResources

Folders and files

Latest commit

History

Repository files navigation

Introduction

Table of Contents:

Educational Resources

CUDA Programming Guide

NVIDIA On Demand

Suggested Talks

NVIDIA Blogs

Suggested Blogs

Architecture White Papers

GPU HW Overview

Types of HW

Compute HW Resources

Memory HW Resources

Operation Specific Dedicated HW

CUDA work breakdown to HW

Balancing HW to Maximize Performance

Generational Changes in GPU HW

Key Performance Drivers and Bottlenecks

Speed of Light

Memory Bound vs Compute Bound

Example Calcuation FP32 10x10 convolution kernel

Example Calcuation 5x5 convolution kernel FFT

Latency vs Throughput

GPU Occupancy

Techniques for Accelerating Code

Optimizing Data Allocations

Optimizing Data Movement

Host APIs vs Device APIs

CUDA Streams

CUDA Graphs

Kernel Stalls

types of stalls

Most common stalls

Kernel Expensive Operations

CUDA Profiling Tools Overview

Nsight Systems

Interpretation Guidance

Nsight Compute

Interpretation Guidance

Nsight Graphics

Interpretation Guidance

Profiling & Optimizing CUDA Math Libraries

CUTLASS Profiler

Suggested use patterns and scripting

nsys/Nsight Systems Examples

ncu/Nsight Compute Examples

Optimization Examples

Example 1: Low Occupancy and Increasing Parallelism

Example 2: Reducing Register Pressure with Shared Memory

Example 3: Optimizing Plan Creation for cuFFT

Example 4: using cuBlasLT Auto Tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages