Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block dimension calculation can lead to non-optimal occupancy #266

Open
5 tasks
denisalevi opened this issue Feb 13, 2022 · 1 comment
Open
5 tasks

Block dimension calculation can lead to non-optimal occupancy #266

denisalevi opened this issue Feb 13, 2022 · 1 comment

Comments

@denisalevi
Copy link
Member

The stateupdater for the neurons in our COBAHHUncoupled benchmark (in single-precision) seems to have been simulated with only 640 threads per block even though it could use 1024 threads per block... This leads to a theoretical occupancy of only 62.5%.

We choose the number of threads based on cudaOccupancyMaxPotentialBlockSize, which happens here. The documentation says:

Returns grid and block size that achieves maximum potential occupancy for a device function.

But in this blog post it sounds like the number of threads is based on some heuristics on not best suited for performance-critical kernels:

cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.

Until now, I didn't see cudaOccupancyMaxPotentialBlockSize return a configuration that didn't give optimal occupancy. And I checked the logs for all other benchmarks, which seem to run maximal threads per block where possible. The only occasion where the number of threads is lower is when there are hardware limits reached by the kernel (e.g. for COBAHH and Mushroom body in double-precision). But that is not the case single-precision, where < 64 registers per thread are required. Form my MX150 (same on A100):

|| INFO _run_kernel_neurongroup_stateupdater_codeobject
|| 	7 blocks
|| 	1024 threads
|| 	44 registers per thread
|| 	0 bytes statically-allocated shared memory per block
|| 	0 bytes local memory per thread
|| 	1512 bytes user-allocated constant memory
|| 	0.625 theoretical occupancy

This needs some digging into. I think we could just get rid of cudaOccupancyMaxPotentialBlockSize alltogether? The only reason to keep it would be if for very small networks it would be more efficient to call more small blocks instead of few larger ones. I'm not sure if that is ever the case? Only if scheduler overheads prefer many small kernel over few large ones?

For now I will add a preference to manually overwrite the thread number given by cudaOccupancyMaxPotentialBlockSize and rerun the COBAHHUncoupled benchmark.

When fixing this, also do:

@denisalevi denisalevi added the bug label Feb 13, 2022
denisalevi added a commit that referenced this issue Feb 13, 2022
It appears that the number of threads we pick is not always giving
optimal occupancy, see #266
@denisalevi
Copy link
Member Author

Alright, I think I got it. There are three hardware limits for the execution of blocks on SMs:

  1. A maximal number of threads, for the A100 (and I think most other GPUs as well) that is 2048. This alone would allow two active blocks with each 1024 threads.
  2. A maximal number of registers per SM, for the A100 (and most other GPUs as well) this is 65536.
  3. And finally a maximal number of blocks that can be active at once. I didn't find a CUDA API call yet that would return this number, but it is stored in the deprecated occupancy calculator excel sheet, second page. For the A100, this is 32 blocks.

That means in order to run 2048 threads in two blocks concurrently on a single SM, the kernel can use at most 32 registers per thread (32 registers * 2048 = 65536 registers). I was thinking of this limit on a per block basis, assuming that up to 64 register would be fine. But it is per SM! In the example above, the stateupdater needs 44 registers. The cudaOccupancyMaxPotentialBlockSize chose a number of threads per block that would allow two blocks to be executed concurrently (44 registers * 640 threads < 65536 registers). If one would choose 32 threads (one warp) more per block, the hardware limit on number of registers per SM would be reached.

I'll leave this issue open for the other points I marked above. Also we should investigate who the register usage depends on the neuron model definition. Because the mushroom body benchmark, which has a very similar model as the COBAHH benchmark, requires only 32 registers on the A100 GPU, allowing it to reach 100% occupancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant