Slowdown of GPU Data Transfers in Python Threads #75

insertinterestingnamehere · 2021-05-11T19:04:08Z

Creating this as a placeholder to track progress while we figure out where to even submit this upstream.

Currently GPU transfers in Python threads exhibit unexplained erratic slowdowns. We originally thought these overheads were caused by VECs, however @dialecticDolt did some additional investigation and found that they were caused entirely by use of cudaMemcpy from within threads created by Python. He's verified that this issue does not affect OpenMP's thread pool. We haven't yet verified if this affects threads created using the pthreads interfaces or c++'s std::thread interface, so it is possible that OpenMP is just doing something special instead of Python conflicting with CUDA.

The text was updated successfully, but these errors were encountered:

insertinterestingnamehere · 2021-05-11T19:04:13Z

@dialecticDolt please feel free to add more info here. Where do we have example code to reproduce this?

wlruys · 2021-05-14T04:33:43Z

I've added the examples to reproduce this with/without VECs in https://github.com/ut-parla/Parla.py/tree/master/benchmarks/gpu_threading, as well as the MPI and CPP OpenMP comparisons.

As a log I'm also copying the performance numbers here (from the slack discussion):

`The reported times for the Memcpy (timed with nvprof) are:
Start Time, Duration, Size of Transfer, Transfer Speed, Device Details

On Zemaitis (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy)Before warmup:
1.20125s 2.87918s - 7.4506GB 2.5877GB/s Pageable Device Tesla P100-SXM2 4 17 [CUDA memcpy HtoD]
1.47087s 2.79998s - 7.4506GB 2.6609GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
1.47157s 2.86431s - 7.4506GB 2.6012GB/s Pageable Device Tesla P100-SXM2 2 35 [CUDA memcpy HtoD]
1.47182s 2.75349s - 7.4506GB 2.7059GB/s Pageable Device Tesla P100-SXM2 3 28 [CUDA memcpy HtoD]

Warmed Up:

19.0741s 1.56790s - 7.4506GB 4.7520GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD]
19.0742s 1.50439s - 7.4506GB 4.9525GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
19.0747s 1.46896s - 7.4506GB 5.0720GB/s Pageable Device Tesla P100-SXM2 4 17 [CUDA memcpy HtoD]
19.0767s 1.52812s - 7.4506GB 4.8756GB/s Pageable Device Tesla P100-SXM2 2 37 [CUDA memcpy HtoD]

On Zemaitis (Multithreading in Python, Allocations and Deallocations done with cupy, only timing the memcpy)

14.5530s 2.51998s - 7.4506GB 2.9566GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
14.5531s 1.89952s - 7.4506GB 3.9223GB/s Pageable Device Tesla P100-SXM2 2 17 [CUDA memcpy HtoD]
14.5535s 2.10205s - 7.4506GB 3.5444GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD]
14.5537s 1.79802s - 7.4506GB 4.1438GB/s Pageable Device Tesla P100-SXM2 4 37 [CUDA memcpy HtoD]

Just another sample/trial of the same (the first one ^ is on the low end of the variance):

33.5803s 2.30019s - 7.4506GB 3.2391GB/s Pageable Device Tesla P100-SXM2 2 17 [CUDA memcpy HtoD]
33.5807s 2.17479s - 7.4506GB 3.4259GB/s Pageable Device Tesla P100-SXM2 3 27 [CUDA memcpy HtoD]
33.5807s 2.39922s - 7.4506GB 3.1054GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
33.5808s 2.40678s - 7.4506GB 3.0957GB/s Pageable Device Tesla P100-SXM2 4 37 [CUDA memcpy HtoD]

On Zemaitis (MPI in Python, Allocations and Deallocations done with cupy, only timing the memcpy)

3.69050s 1.42256s - 7.4506GB 5.2374GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
3.78333s 1.43109s - 7.4506GB 5.2062GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
3.83678s 1.44673s - 7.4506GB 5.1499GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]
3.88229s 1.44153s - 7.4506GB 5.1685GB/s Pageable Device Tesla P100-SXM2 1 7 [CUDA memcpy HtoD]

On Frontera (Openmp in CPP, Allocations and Deallocations done ahead of time, only timing the memcpy)

829.65ms 889.74ms - 7.4506GB 8.3739GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
882.30ms 1.13772s - 7.4506GB 6.5487GB/s Pageable Device Quadro RTX 5000 4 17 [CUDA memcpy HtoD]
997.03ms 1.20776s - 7.4506GB 6.1689GB/s Pageable Device Quadro RTX 5000 2 37 [CUDA memcpy HtoD]
997.41ms 1.21396s - 7.4506GB 6.1374GB/s Pageable Device Quadro RTX 5000 3 27 [CUDA memcpy HtoD]

On Frontera (Multiprocess with MPI in Python, Allocations and Deallocations done with cupy, only timing the Memcpy.)

3.38638s 1.13888s - 7.4506GB 6.5420GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
3.58987s 1.13825s - 7.4506GB 6.5457GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
3.69940s 1.21641s - 7.4506GB 6.1251GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
3.73599s 1.21368s - 7.4506GB 6.1388GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]

On Frontera ( Multithreading in Python, Allocations and Deallocations done with cupy, only timing the Memcpy. )

11.5841s 1.40261s - 7.4506GB 5.3119GB/s Pageable Device Quadro RTX 5000 1 7 [CUDA memcpy HtoD]
11.5843s 1.57895s - 7.4506GB 4.7187GB/s Pageable Device Quadro RTX 5000 2 18 [CUDA memcpy HtoD]
11.5845s 1.64461s - 7.4506GB 4.5303GB/s Pageable Device Quadro RTX 5000 3 29 [CUDA memcpy HtoD]
11.5846s 1.28430s - 7.4506GB 5.8013GB/s Pageable Device Quadro RTX 5000 4 40 [CUDA memcpy HtoD]`

insertinterestingnamehere added performance Runtime performance of Parla or Parla programs bug Something isn't working labels May 11, 2021

insertinterestingnamehere added this to the VEC Paper Revisions milestone Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowdown of GPU Data Transfers in Python Threads #75

Slowdown of GPU Data Transfers in Python Threads #75

insertinterestingnamehere commented May 11, 2021

insertinterestingnamehere commented May 11, 2021

wlruys commented May 14, 2021 •

edited

Loading

Slowdown of GPU Data Transfers in Python Threads #75

Slowdown of GPU Data Transfers in Python Threads #75

Comments

insertinterestingnamehere commented May 11, 2021

insertinterestingnamehere commented May 11, 2021

wlruys commented May 14, 2021 • edited Loading

wlruys commented May 14, 2021 •

edited

Loading