Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386

Open
dirkgr opened this issue Sep 5, 2022 · 6 comments
Assignees
Labels
bug Something isn't working integration: beaker Relates to the Beaker integration

Comments

@dirkgr
Copy link
Member

dirkgr commented Sep 5, 2022

🐛 Describe the bug

Run the catwalk training job specified here: allenai/catwalk@5ba0192

Command line is tango --settings experiments/train_all_the_things/tango.yml run experiments/train_all_the_things/train_all_the_things.jsonnet.

It will run for quite a while. Two jobs will usually fail. I don't know why it's always two.

  1. When the run is complete, it tells you that two jobs failed, but you have no way of finding the logs for the failed jobs. (When Beaker jobs fail, the final output must contain a link to the logs of the failed job #387)
  2. When you run it again, it finds all the succeeded jobs, as it should. The failed ones will start up again, but hang indefinitely somewhere in the Beaker workspace.
  3. When you check your workspace, you will see hundreds of uncommitted datasets. Many of these contain actual, successful results. But two of them are the cause for the hang. It's impossible to know which ones to delete so I can finish my run. You can see many of these uncommitted datasets in the ai2/task-complexity workspace.

The actual problem here is the undiagnosable failure of the original jobs, and then the undiagnosable hang (which can be fixed if you remove the right uncommitted datasets). From a user perspective, having all those uncommitted datasets doesn't matter as such, but I suspect that it points to a deeper issue.

Versions

asd

@dirkgr
Copy link
Member Author

dirkgr commented Sep 6, 2022

I did another run like this. This time only one of them failed. The results table points me to this dataset: https://beaker.org/ds/01GC7PCX5M9B357GX6YJFY7C5R/details.

This is clearly incomplete. Its presence will prevent a re-run from succeeding. I know of no way to find the logs that would show me why it failed.

@dirkgr
Copy link
Member Author

dirkgr commented Sep 6, 2022

Ah! I can search experiments by name, which reveals this error message:

2022-09-05T21:18:13.535899802Z {"name": "root", "msg": "[step trained_model_arc_challenge_bert-base-uncased_2147483647] Uncaught exception", "args": {"py/tuple": []}, "levelname": "ERROR", "levelno": 40, "pathname": "/opt/conda/lib/python3.9/site-packages/tango/common/logging.py", "filename": "logging.py", "module": "logging", "exc_info": {"py/tuple": [{"py/type": "beaker.exceptions.BeakerError"}, {"py/reduce": [{"py/type": "beaker.exceptions.BeakerError"}, {"py/tuple": ["appendErr after appendLimiter.Wait (chunkSize=2316455): googleapi: Error 429: The rate of change requests to the object fileheap-public/uploads/400c75/01gc7r8yz8thyxx16sbej4xg28_$folder$ exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded"]}, {}]}, {"tb_frame": {"f_globals": {"__name__": "__main__", "__file__": "/opt/conda/bin/tango"}, "f_code": {"co_filename": "/opt/conda/bin/tango", "co_name": "<module>"}, "f_lineno": 8}, "tb_lineno": 8, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "__call__"}, "f_lineno": 1130}, "tb_lineno": 1130, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "main"}, "f_lineno": 1098}, "tb_lineno": 1055, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 1657}, "tb_lineno": 1657, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 1404}, "tb_lineno": 1404, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 760}, "tb_lineno": 760, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.__main__", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/__main__.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/__main__.py", "co_name": "beaker_executor_run"}, "f_lineno": 393}, "tb_lineno": 393, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.executor", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/executor.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/executor.py", "co_name": "execute_step"}, "f_lineno": 68}, "tb_lineno": 68, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "ensure_result"}, "f_lineno": 615}, "tb_lineno": 615, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "result"}, "f_lineno": 589}, "tb_lineno": 586, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "_run_with_work_dir"}, "f_lineno": 431}, "tb_lineno": 424, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.workspace", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py", "co_name": "step_finished"}, "f_lineno": 163}, "tb_lineno": 163, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.step_cache", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py", "co_name": "__setitem__"}, "f_lineno": 150}, "tb_lineno": 144, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.step_cache", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py", "co_name": "_sync_step_dataset"}, "f_lineno": 75}, "tb_lineno": 72, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.dataset", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py", "co_name": "sync"}, "f_lineno": 468}, "tb_lineno": 464, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures._base", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/_base.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/_base.py", "co_name": "result"}, "f_lineno": 451}, "tb_lineno": 439, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures._base", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/_base.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/_base.py", "co_name": "__get_result"}, "f_lineno": 394}, "tb_lineno": 391, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures.thread", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/thread.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/thread.py", "co_name": "run"}, "f_lineno": 64}, "tb_lineno": 58, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.dataset", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py", "co_name": "_upload_file"}, "f_lineno": 623}, "tb_lineno": 587, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.service_client", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py", "co_name": "request"}, "f_lineno": 98}, "tb_lineno": 98, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.service_client", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py", "co_name": "make_request"}, "f_lineno": 92}, "tb_lineno": 90, "tb_next": null}}}}}}}}}}}}}}}}}}}}}]}, "exc_text": null, "stack_info": null, "lineno": 381, "funcName": "log_exc_info", "created": 1662412693.5311759, "msecs": 531.1758518218994, "relativeCreated": 1685521.075963974, "thread": 140370629374528, "threadName": "MainThread", "processName": "MainProcess", "process": 1, "highlighter": {"py/object": "rich.highlighter.ReprHighlighter"}}

So that explains why it's always two that fail. We're writing stuff to beaker too quickly.

@dirkgr
Copy link
Member Author

dirkgr commented Sep 6, 2022

I will leave all this up like it is in case it helps with debugging.

@epwalsh
Copy link
Member

epwalsh commented Sep 6, 2022

I did another run like this. This time only one of them failed. The results table points me to this dataset: https://beaker.org/ds/01GC7PCX5M9B357GX6YJFY7C5R/details.

This is clearly incomplete. Its presence will prevent a re-run from succeeding. I know of no way to find the logs that would show me why it failed.

What happens here? Does it hang? Do you know where it hangs? I'm guessing the 429 error you discovered leaves this dataset in a bad state somehow.

@dirkgr
Copy link
Member Author

dirkgr commented Sep 6, 2022

The job finishes with (in this case) one failure and a few dependent steps not run. The problems with how this goes down are these:

  1. It shouldn't have failed in the first place. The error was a beaker error that happened while writing the results, I think. Details in Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386 (comment).
  2. When it fails, it should give me an easy click to the logs that show the error. Instead get a click to the dataset, which is incomplete. I can only find the logs if I search for the step part of the dataset name in my list of experiments. That's not very discoverable.
  3. When I re-run, it should not get stuck. Currently, the second run will get stuck when it's time to start the step. This might be the 429 error, so it might be already fixed.

@epwalsh
Copy link
Member

epwalsh commented Sep 6, 2022

Currently, the second run will get stuck when it's time to start the step.

What do you mean get stuck?

@epwalsh epwalsh added the integration: beaker Relates to the Beaker integration label Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integration: beaker Relates to the Beaker integration
Projects
None yet
Development

No branches or pull requests

2 participants