Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low GPU utilization on 1080ti, 2080ti and TitanX #80

Open
DiamonDinoia opened this issue Jun 13, 2020 · 4 comments
Open

Low GPU utilization on 1080ti, 2080ti and TitanX #80

DiamonDinoia opened this issue Jun 13, 2020 · 4 comments

Comments

@DiamonDinoia
Copy link

DiamonDinoia commented Jun 13, 2020

Hello,
I have been using the library for one of my reasearh projects. I noticed that the GPU is not fully utilized. By reading the code I noticed that there are some hardcoded values. For example:

dim3 block_dim(64, 1, 8);

inline dim3 getOptimalGridDim(long N, long thread_count)

#define THREAD_BLOCK_SIZE 256

Do you know if it is possible to change these values to increase the parallelism?
Or is there another way to do so? I'm happy to splend some time making these values parametric based on the architecture.

Another possible strategy would be the "CUDA Dynamic Parallelism" if these values cannot be changed (https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/).

Thanks.
Marco

@andyschwarzl
Copy link
Owner

Hi,

thanks for pointing that out. The code has been written to support most of nowadays "old" GPUs starting even with the support of compute capabilities 1.3 onwards :P

Feel free to modify the code and I appreciate any pull request that make GPU utilization more dynamically.

Thanks!

Best regards,

Andreas

@DiamonDinoia
Copy link
Author

Do you know about any problems with that ans how did you derive that numbers?
It will be a good starting point because if they depend on the input it might be possible to use dynamic parallelism, if they depend on the hardware I can use the runtime to determine them.

@andyschwarzl
Copy link
Owner

The parameters above are basically hardware-dependent and I used the occupancy tool the derive these values. Another parameter is the sectorWidth which basically defines the amount of shared memory used per thread block. So I guess a good starting point would be to increase the thread-count and sectorWidth to see if the performance/utilization increase.

@chaithyagr
Copy link
Contributor

Perhaps we could add code based on CUDA capability protected with #ifdef
Further, I noticed that n_coils_cc is always 1 for 3D, maybe we could use the same function we use for GPU memory estimation for 2D to better obtain a good n_coils_cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants