Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Use threads instead of processes in Dataset.summaries #242

Merged
merged 1 commit into from
Nov 28, 2018

Conversation

tdsmith
Copy link
Contributor

@tdsmith tdsmith commented Nov 21, 2018

Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once.
ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix.
Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/

This is a possible source of observed deadlocks during calls to Dataset.records.

Using threads should not be a performance regression since the operation we're parallelizing over is network-bound,
not CPU-bound, so there should not be much contention for the GIL.

Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once.
ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix.
Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/

This is a possible source of observed deadlocks during calls to Dataset.records.

Using threads should not be a performance regression since the operation we're parallelizing over is network-bound,
not CPU-bound, so there should not be much contention for the GIL.
@tdsmith tdsmith requested review from jklukas and sunahsuh November 21, 2018 20:09
@codecov-io
Copy link

codecov-io commented Nov 21, 2018

Codecov Report

Merging #242 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #242   +/-   ##
=======================================
  Coverage   80.05%   80.05%           
=======================================
  Files          11       11           
  Lines        1053     1053           
=======================================
  Hits          843      843           
  Misses        210      210
Flag Coverage Δ
#py27 79.86% <100%> (ø) ⬆️
#py35 79.01% <100%> (ø) ⬆️
#py36 79.01% <100%> (ø) ⬆️
Impacted Files Coverage Δ
moztelemetry/dataset.py 95% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb68074...75aa74f. Read the comment docs.

Copy link
Contributor

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing on IRC right now. Before merging, we should verify that this actually solves the problem by running an affected job with these changes.

I don't have a clear idea of the potential risks of the change, but likely seems worth it if we have have a concrete case where it relieves pain.

@tdsmith tdsmith removed the request for review from sunahsuh November 21, 2018 21:27
@tdsmith
Copy link
Contributor Author

tdsmith commented Nov 21, 2018

I did a run with this branch and it didn't fail: https://dbc-caf9527b-e073.cloud.databricks.com/#job/753/run/3

That's not strong evidence that I actually fixed anything, since this job is flaky but sometimes succeeds.

I can keep my job pinned on this branch for a week or so if you'd like to shake it out, though I don't think it should break anything.

@jklukas
Copy link
Contributor

jklukas commented Nov 21, 2018

I can keep my job pinned on this branch for a week or so if you'd like to shake it out, though I don't think it should break anything.

If it's not a significant inconvenience, let's do that. Then we avoid the confusion of rolling back the change if we find it's not sufficient to make performance predictable.

@tdsmith
Copy link
Contributor Author

tdsmith commented Nov 28, 2018

There have been no additional related failures since pinning this branch on the 21st: https://dbc-caf9527b-e073.cloud.databricks.com/#job/715

(the failure of run 47 is because I was in the middle of revising the notebook when the job triggered, oops)

Copy link
Contributor

@jklukas jklukas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Merging.

@jklukas jklukas merged commit 2f030ed into mozilla:master Nov 28, 2018
@tdsmith
Copy link
Contributor Author

tdsmith commented Nov 28, 2018

🎉 Thanks!

@tdsmith tdsmith deleted the dataset-deadlocks branch November 28, 2018 17:53
@jklukas
Copy link
Contributor

jklukas commented Nov 28, 2018

You should be able to update the library in databricks once it's published.

@jklukas
Copy link
Contributor

jklukas commented Nov 28, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants