-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow memory management on Nvidia GPUs #841
Comments
We do test LS, namely H2O-DFT-LS. I don't see any connection with the GPU type, the data movement is GPU-agnostic. Specifically, for GPU data allocation we use memory pools, so I would not expect any big impact from that. I can assume these are allocations of the indices, which are async, so the effect should be minimal. |
I have recently ran tests on a GH200 system with the OpenCL backend. The OpenCL backend has support for profiling results to appear in DBCSRs / CP2Ks regular profile (end of execution). The allocations were visible for both host- and GPU-backed memory. Though, this can also depend on the node's configuration like amount of page-lockable memory, etc. Still, the time spent was relatively negligible compared to the total time to solution spent (wall time). |
These are the prototypes that allow to call CP2K/DBCSR's timer facility for instance in |
@alazzaro I have sent you a mail. |
OK, so I've checked the slides and my understanding is that the problem is appearing on the first multiplications, which is expected. dbcsr/src/data/dbcsr_mem_methods.F Line 207 in f4e8c38
Then, there is a function to ensure that size of the buffers is OK. The part where this function is called is for the C matrix: dbcsr/src/mm/dbcsr_mm_cannon.F Line 1199 in f4e8c38
where we also try to make an educate guess of the final size (per each thread). Now, the occupancies of the matrices increase with the multiplications, up to a given plateau. So, in the first multiplications there is a reallocation of the memory, but then we use the memory loop and do not reallocate. So, the benchmark itself can have a bit of overhead, but in the real production runs (with many more multiplications) the effect is negligible. I can image we make the resize_factor as an external parameters so that we can avoid reallocations (at the cost of large memory footprint). cudaMallocAsync will require some refactoring, but I don't think it is worth the pain. |
@fstein93 was the issue discovered on GH200 like Alps? |
It was 8xH100 with 2 ranks per GPU. I did not run the tests. |
If a DBCSR-heavy calculation in CP2K (LS_SCF) is profiled on NVIDIA GPUs, it turns out that DBCSR spends a lot (most) of the time on allocating/freeing memory on GPUs (tested on H100). PM for additional data. Potentially, this may also be the case on AMD hardware.
The text was updated successfully, but these errors were encountered: