Use threads instead of processes in Dataset.summaries #242

tdsmith · 2018-11-21T20:09:12Z

Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once.
ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix.
Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/

This is a possible source of observed deadlocks during calls to Dataset.records.

Using threads should not be a performance regression since the operation we're parallelizing over is network-bound,
not CPU-bound, so there should not be much contention for the GIL.

Dataset.summaries uses a concurrent.futures.ProcessPoolExecutor to fetch multiple files from S3 at once. ProcessPoolExecutor uses multiprocessing underneath, which defaults to using fork() on Unix. Using fork() is dangerous and prone to deadlocks: https://codewithoutrules.com/2018/09/04/python-multiprocessing/ This is a possible source of observed deadlocks during calls to Dataset.records. Using threads should not be a performance regression since the operation we're parallelizing over is network-bound, not CPU-bound, so there should not be much contention for the GIL.

codecov-io · 2018-11-21T20:11:09Z

Codecov Report

Merging #242 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #242   +/-   ##
=======================================
  Coverage   80.05%   80.05%           
=======================================
  Files          11       11           
  Lines        1053     1053           
=======================================
  Hits          843      843           
  Misses        210      210

Flag	Coverage Δ
#py27	`79.86% <100%> (ø)`	⬆️
#py35	`79.01% <100%> (ø)`	⬆️
#py36	`79.01% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
moztelemetry/dataset.py	`95% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb68074...75aa74f. Read the comment docs.

jklukas

Discussing on IRC right now. Before merging, we should verify that this actually solves the problem by running an affected job with these changes.

I don't have a clear idea of the potential risks of the change, but likely seems worth it if we have have a concrete case where it relieves pain.

tdsmith · 2018-11-21T21:35:58Z

I did a run with this branch and it didn't fail: https://dbc-caf9527b-e073.cloud.databricks.com/#job/753/run/3

That's not strong evidence that I actually fixed anything, since this job is flaky but sometimes succeeds.

I can keep my job pinned on this branch for a week or so if you'd like to shake it out, though I don't think it should break anything.

jklukas · 2018-11-21T21:38:06Z

I can keep my job pinned on this branch for a week or so if you'd like to shake it out, though I don't think it should break anything.

If it's not a significant inconvenience, let's do that. Then we avoid the confusion of rolling back the change if we find it's not sufficient to make performance predictable.

tdsmith · 2018-11-28T17:51:29Z

There have been no additional related failures since pinning this branch on the 21st: https://dbc-caf9527b-e073.cloud.databricks.com/#job/715

(the failure of run 47 is because I was in the middle of revising the notebook when the job triggered, oops)

jklukas

Sounds good. Merging.

tdsmith · 2018-11-28T17:53:40Z

🎉 Thanks!

jklukas · 2018-11-28T17:53:53Z

You should be able to update the library in databricks once it's published.

jklukas · 2018-11-28T18:05:57Z

Now published: https://pypi.org/project/python_moztelemetry/0.10.5/

tdsmith requested review from jklukas and sunahsuh November 21, 2018 20:09

jklukas reviewed Nov 21, 2018

View reviewed changes

tdsmith removed the request for review from sunahsuh November 21, 2018 21:27

jklukas approved these changes Nov 28, 2018

View reviewed changes

jklukas merged commit 2f030ed into mozilla:master Nov 28, 2018

tdsmith deleted the dataset-deadlocks branch November 28, 2018 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use threads instead of processes in Dataset.summaries #242

Use threads instead of processes in Dataset.summaries #242

Uh oh!

tdsmith commented Nov 21, 2018

Uh oh!

codecov-io commented Nov 21, 2018 •

edited

Loading

Uh oh!

jklukas left a comment

Uh oh!

tdsmith commented Nov 21, 2018

Uh oh!

jklukas commented Nov 21, 2018

Uh oh!

tdsmith commented Nov 28, 2018

Uh oh!

jklukas left a comment

Uh oh!

tdsmith commented Nov 28, 2018

Uh oh!

jklukas commented Nov 28, 2018

Uh oh!

jklukas commented Nov 28, 2018

Uh oh!

Uh oh!

Use threads instead of processes in Dataset.summaries #242

Use threads instead of processes in Dataset.summaries #242

Uh oh!

Conversation

tdsmith commented Nov 21, 2018

Uh oh!

codecov-io commented Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jklukas left a comment

Choose a reason for hiding this comment

Uh oh!

tdsmith commented Nov 21, 2018

Uh oh!

jklukas commented Nov 21, 2018

Uh oh!

tdsmith commented Nov 28, 2018

Uh oh!

jklukas left a comment

Choose a reason for hiding this comment

Uh oh!

tdsmith commented Nov 28, 2018

Uh oh!

jklukas commented Nov 28, 2018

Uh oh!

jklukas commented Nov 28, 2018

Uh oh!

Uh oh!

codecov-io commented Nov 21, 2018 •

edited

Loading