-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slowdown of GPU Data Transfers in Python Threads #75
Comments
@dialecticDolt please feel free to add more info here. Where do we have example code to reproduce this? |
I've added the examples to reproduce this with/without VECs in https://github.com/ut-parla/Parla.py/tree/master/benchmarks/gpu_threading, as well as the MPI and CPP OpenMP comparisons. As a log I'm also copying the performance numbers here (from the slack discussion): `The reported times for the Memcpy (timed with nvprof) are: On Zemaitis (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy)Before warmup: Warmed Up: 19.0741s 1.56790s - 7.4506GB 4.7520GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD] On Zemaitis (Multithreading in Python, Allocations and Deallocations done with cupy, only timing the memcpy) 14.5530s 2.51998s - 7.4506GB 2.9566GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] Just another sample/trial of the same (the first one ^ is on the low end of the variance): 33.5803s 2.30019s - 7.4506GB 3.2391GB/s Pageable Device Tesla P100-SXM2 2 17 [CUDA memcpy HtoD] On Zemaitis (MPI in Python, Allocations and Deallocations done with cupy, only timing the memcpy) 3.69050s 1.42256s - 7.4506GB 5.2374GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD] On Frontera (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy) 829.65ms 889.74ms - 7.4506GB 8.3739GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] On Frontera (Multiprocess with MPI in Python, Allocations and Deallocations done with cupy, only timing the Memcpy.) 3.38638s 1.13888s - 7.4506GB 6.5420GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] On Frontera ( Multithreading in Python, Allocations and Deallocations done with cupy, only timing the Memcpy. ) 11.5841s 1.40261s - 7.4506GB 5.3119GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD] |
Creating this as a placeholder to track progress while we figure out where to even submit this upstream.
Currently GPU transfers in Python threads exhibit unexplained erratic slowdowns. We originally thought these overheads were caused by VECs, however @dialecticDolt did some additional investigation and found that they were caused entirely by use of cudaMemcpy from within threads created by Python. He's verified that this issue does not affect OpenMP's thread pool. We haven't yet verified if this affects threads created using the pthreads interfaces or c++'s
std::thread
interface, so it is possible that OpenMP is just doing something special instead of Python conflicting with CUDA.The text was updated successfully, but these errors were encountered: