-
Notifications
You must be signed in to change notification settings - Fork 26
Use threads instead of processes in Dataset.summaries #242
Conversation
Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once. ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix. Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/ This is a possible source of observed deadlocks during calls to Dataset.records. Using threads should not be a performance regression since the operation we're parallelizing over is network-bound, not CPU-bound, so there should not be much contention for the GIL.
Codecov Report
@@ Coverage Diff @@
## master #242 +/- ##
=======================================
Coverage 80.05% 80.05%
=======================================
Files 11 11
Lines 1053 1053
=======================================
Hits 843 843
Misses 210 210
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing on IRC right now. Before merging, we should verify that this actually solves the problem by running an affected job with these changes.
I don't have a clear idea of the potential risks of the change, but likely seems worth it if we have have a concrete case where it relieves pain.
I did a run with this branch and it didn't fail: https://dbc-caf9527b-e073.cloud.databricks.com/#job/753/run/3 That's not strong evidence that I actually fixed anything, since this job is flaky but sometimes succeeds. I can keep my job pinned on this branch for a week or so if you'd like to shake it out, though I don't think it should break anything. |
If it's not a significant inconvenience, let's do that. Then we avoid the confusion of rolling back the change if we find it's not sufficient to make performance predictable. |
There have been no additional related failures since pinning this branch on the 21st: https://dbc-caf9527b-e073.cloud.databricks.com/#job/715 (the failure of run 47 is because I was in the middle of revising the notebook when the job triggered, oops) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Merging.
🎉 Thanks! |
You should be able to update the library in databricks once it's published. |
Now published: https://pypi.org/project/python_moztelemetry/0.10.5/ |
Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once.
ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix.
Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/
This is a possible source of observed deadlocks during calls to Dataset.records.
Using threads should not be a performance regression since the operation we're parallelizing over is network-bound,
not CPU-bound, so there should not be much contention for the GIL.