Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryptic BrokenProcessPool exception #441

Closed
SpheMakh opened this issue Mar 4, 2021 · 6 comments
Closed

Cryptic BrokenProcessPool exception #441

SpheMakh opened this issue Mar 4, 2021 · 6 comments

Comments

@SpheMakh
Copy link
Collaborator

SpheMakh commented Mar 4, 2021

I'm not sure how to describe this, except for saying it happened. Has anyone seen it before

# INFO      23:09:58 - main               [io] [6.9/7.1 9.7/12.9 40.4Gb] loading tile 1/4
# INFO      23:09:58 - ms_tile            [io] [6.9/7.1 9.7/12.9 40.4Gb] tile 1/4: reading MS rows 109800~204959
# INFO      23:09:58 - main               [0.2/7.1 3.1/12.9 40.4Gb] submitting solver jobs for tile 0/4
# INFO      23:09:58 - data_handler       [io] [6.9/7.1 9.7/12.9 40.4Gb] reading DATA
# INFO      23:11:43 - data_handler       [io] [30.1/111.8 35.9/135.4 39.4Gb] reading BITFLAG
# INFO      23:13:45 - ms_tile            [io] [43.2/134.4 46.0/153.5 32.6Gb]   74.30% input visibilities flagged and/or deselected
# INFO      23:17:25 - solver             [x04] [21.9/89.3 25.0/105.7 41.8Gb] D0T0F3 has 2244 (0.00%) new data flags: MadMax took out 2244 visibilities (0 in final round)
# INFO      23:18:00 - solver             [x03] [21.9/73.2 25.0/104.5 44.6Gb] D0T0F2 has 2902 (0.00%) new data flags: MadMax took out 2902 visibilities (0 in final round)
# INFO      23:18:29 - solver             [x01] [21.9/75.3 25.0/141.6 47.7Gb] D0T0F0 has 396 (0.00%) new data flags: MadMax took out 396 visibilities (0 in final round)
# INFO      23:19:21 - solver             [x02] [21.9/76.4 25.1/119.7 51.9Gb] D0T0F1 has 33056 (0.03%) new data flags: MadMax took out 33056 visibilities (0 in final round)
# INFO      23:28:13 - solver             [x04] [22.0/94.0 25.1/109.5 55.7Gb] D0T0F4 has 478680 (0.43%) new data flags: MadMax took out 478680 visibilities (0 in final round)
# INFO      23:28:41 - solver             [x03] [22.0/69.7 25.1/85.3 57.4Gb] D0T0F5 has 476000 (0.43%) new data flags: MadMax took out 476000 visibilities (0 in final round)
# INFO      23:31:41 - solver             [x01] [22.0/53.9 25.1/69.7 59.2Gb] D0T0F6 has 152152 (0.14%) new data flags: MadMax took out 152152 visibilities (0 in final round)
# INFO      23:34:48 - solver             [x02] [22.0/32.7 25.1/48.2 61.0Gb] D0T0F7 has 293768 (0.27%) new data flags: MadMax took out 293768 visibilities (0 in final round)
# INFO      23:34:54 - main               [0.1/14.2 3.3/29.7 62.7Gb] finished processing tile 0/4
# INFO      23:34:54 - main               [0.1/14.2 3.3/29.7 62.7Gb] waiting for I/O on tile 1/4
# INFO      23:34:55 - main               [0.1/14.2 3.3/29.7 62.7Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
#  Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 578, in main
#     stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 226, in run_process_loop
#     return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 286, in _run_multi_process_loop
#     if not io_futures[itile].result():
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
#     return self.__get_result()
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
#     raise self._exception
# concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
@o-smirnov
Copy link
Collaborator

Any errors above that? (Maybe put the full log on pastebin?)

BrokenProcessPool is just fancy speak for a subprocess (probably the I/O one, looking at the context) segfaulting or running out of memory (how much you got?) or otherwise blowing up. It may be cryptic, but it's already an improvement on just hanging there (see e.g. #439), which older versions of concurrent.futures did. We should add an exception handler to print a prettier message though.

@SpheMakh
Copy link
Collaborator Author

SpheMakh commented Mar 4, 2021

I had memory on the previous run, then tweaked a few but got this instead. The machine has 188Gb total memory

Full log is here https://pastebin.com/Mhiu6M9s

@SpheMakh
Copy link
Collaborator Author

SpheMakh commented Mar 5, 2021

It goes away after I tweaked some of the dist-<foo> options. Indeed, a prettier error message telling the user to adjust dist-<foo> options will do

@o-smirnov
Copy link
Collaborator

I'll work on a better error as part of #439. It's a bit tricky to catch when subprocesses just die on you with no warning, as they do for out-of-memory...

@SpheMakh
Copy link
Collaborator Author

SpheMakh commented Mar 5, 2021

Should I close this?

@o-smirnov
Copy link
Collaborator

Yeah it's a duplicate of #439 effectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants