Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386

dirkgr · 2022-09-05T20:56:08Z

🐛 Describe the bug

Run the catwalk training job specified here: allenai/catwalk@5ba0192

Command line is tango --settings experiments/train_all_the_things/tango.yml run experiments/train_all_the_things/train_all_the_things.jsonnet.

It will run for quite a while. Two jobs will usually fail. I don't know why it's always two.

When the run is complete, it tells you that two jobs failed, but you have no way of finding the logs for the failed jobs. (When Beaker jobs fail, the final output must contain a link to the logs of the failed job #387)
When you run it again, it finds all the succeeded jobs, as it should. The failed ones will start up again, but hang indefinitely somewhere in the Beaker workspace.
When you check your workspace, you will see hundreds of uncommitted datasets. Many of these contain actual, successful results. But two of them are the cause for the hang. It's impossible to know which ones to delete so I can finish my run. You can see many of these uncommitted datasets in the ai2/task-complexity workspace.

The actual problem here is the undiagnosable failure of the original jobs, and then the undiagnosable hang (which can be fixed if you remove the right uncommitted datasets). From a user perspective, having all those uncommitted datasets doesn't matter as such, but I suspect that it points to a deeper issue.

Versions

asd

The text was updated successfully, but these errors were encountered:

dirkgr · 2022-09-06T02:22:21Z

I did another run like this. This time only one of them failed. The results table points me to this dataset: https://beaker.org/ds/01GC7PCX5M9B357GX6YJFY7C5R/details.

This is clearly incomplete. Its presence will prevent a re-run from succeeding. I know of no way to find the logs that would show me why it failed.

dirkgr · 2022-09-06T02:24:16Z

Ah! I can search experiments by name, which reveals this error message:

2022-09-05T21:18:13.535899802Z {"name": "root", "msg": "[step trained_model_arc_challenge_bert-base-uncased_2147483647] Uncaught exception", "args": {"py/tuple": []}, "levelname": "ERROR", "levelno": 40, "pathname": "/opt/conda/lib/python3.9/site-packages/tango/common/logging.py", "filename": "logging.py", "module": "logging", "exc_info": {"py/tuple": [{"py/type": "beaker.exceptions.BeakerError"}, {"py/reduce": [{"py/type": "beaker.exceptions.BeakerError"}, {"py/tuple": ["appendErr after appendLimiter.Wait (chunkSize=2316455): googleapi: Error 429: The rate of change requests to the object fileheap-public/uploads/400c75/01gc7r8yz8thyxx16sbej4xg28_$folder$ exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded"]}, {}]}, {"tb_frame": {"f_globals": {"__name__": "__main__", "__file__": "/opt/conda/bin/tango"}, "f_code": {"co_filename": "/opt/conda/bin/tango", "co_name": "<module>"}, "f_lineno": 8}, "tb_lineno": 8, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "__call__"}, "f_lineno": 1130}, "tb_lineno": 1130, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "main"}, "f_lineno": 1098}, "tb_lineno": 1055, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 1657}, "tb_lineno": 1657, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 1404}, "tb_lineno": 1404, "tb_next": {"tb_frame": {"f_globals": {"__name__": "click.core", "__file__": "/opt/conda/lib/python3.9/site-packages/click/core.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/click/core.py", "co_name": "invoke"}, "f_lineno": 760}, "tb_lineno": 760, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.__main__", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/__main__.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/__main__.py", "co_name": "beaker_executor_run"}, "f_lineno": 393}, "tb_lineno": 393, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.executor", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/executor.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/executor.py", "co_name": "execute_step"}, "f_lineno": 68}, "tb_lineno": 68, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "ensure_result"}, "f_lineno": 615}, "tb_lineno": 615, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "result"}, "f_lineno": 589}, "tb_lineno": 586, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.step", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/step.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/step.py", "co_name": "_run_with_work_dir"}, "f_lineno": 431}, "tb_lineno": 424, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.workspace", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py", "co_name": "step_finished"}, "f_lineno": 163}, "tb_lineno": 163, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.step_cache", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py", "co_name": "__setitem__"}, "f_lineno": 150}, "tb_lineno": 144, "tb_next": {"tb_frame": {"f_globals": {"__name__": "tango.integrations.beaker.step_cache", "__file__": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/tango/integrations/beaker/step_cache.py", "co_name": "_sync_step_dataset"}, "f_lineno": 75}, "tb_lineno": 72, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.dataset", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py", "co_name": "sync"}, "f_lineno": 468}, "tb_lineno": 464, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures._base", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/_base.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/_base.py", "co_name": "result"}, "f_lineno": 451}, "tb_lineno": 439, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures._base", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/_base.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/_base.py", "co_name": "__get_result"}, "f_lineno": 394}, "tb_lineno": 391, "tb_next": {"tb_frame": {"f_globals": {"__name__": "concurrent.futures.thread", "__file__": "/opt/conda/lib/python3.9/concurrent/futures/thread.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/concurrent/futures/thread.py", "co_name": "run"}, "f_lineno": 64}, "tb_lineno": 58, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.dataset", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/dataset.py", "co_name": "_upload_file"}, "f_lineno": 623}, "tb_lineno": 587, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.service_client", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py", "co_name": "request"}, "f_lineno": 98}, "tb_lineno": 98, "tb_next": {"tb_frame": {"f_globals": {"__name__": "beaker.services.service_client", "__file__": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py"}, "f_code": {"co_filename": "/opt/conda/lib/python3.9/site-packages/beaker/services/service_client.py", "co_name": "make_request"}, "f_lineno": 92}, "tb_lineno": 90, "tb_next": null}}}}}}}}}}}}}}}}}}}}}]}, "exc_text": null, "stack_info": null, "lineno": 381, "funcName": "log_exc_info", "created": 1662412693.5311759, "msecs": 531.1758518218994, "relativeCreated": 1685521.075963974, "thread": 140370629374528, "threadName": "MainThread", "processName": "MainProcess", "process": 1, "highlighter": {"py/object": "rich.highlighter.ReprHighlighter"}}

So that explains why it's always two that fail. We're writing stuff to beaker too quickly.

dirkgr · 2022-09-06T02:24:43Z

I will leave all this up like it is in case it helps with debugging.

epwalsh · 2022-09-06T20:37:21Z

I did another run like this. This time only one of them failed. The results table points me to this dataset: https://beaker.org/ds/01GC7PCX5M9B357GX6YJFY7C5R/details.

This is clearly incomplete. Its presence will prevent a re-run from succeeding. I know of no way to find the logs that would show me why it failed.

What happens here? Does it hang? Do you know where it hangs? I'm guessing the 429 error you discovered leaves this dataset in a bad state somehow.

dirkgr · 2022-09-06T20:42:04Z

The job finishes with (in this case) one failure and a few dependent steps not run. The problems with how this goes down are these:

It shouldn't have failed in the first place. The error was a beaker error that happened while writing the results, I think. Details in Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386 (comment).
When it fails, it should give me an easy click to the logs that show the error. Instead get a click to the dataset, which is incomplete. I can only find the logs if I search for the step part of the dataset name in my list of experiments. That's not very discoverable.
When I re-run, it should not get stuck. Currently, the second run will get stuck when it's time to start the step. This might be the 429 error, so it might be already fixed.

epwalsh · 2022-09-06T20:59:07Z

Currently, the second run will get stuck when it's time to start the step.

What do you mean get stuck?

dirkgr added the bug Something isn't working label Sep 5, 2022

dirkgr assigned epwalsh Sep 5, 2022

dirkgr mentioned this issue Sep 5, 2022

When Beaker jobs fail, the final output must contain a link to the logs of the failed job #387

Closed

epwalsh mentioned this issue Sep 6, 2022

Treat 429 errors as recoverable allenai/beaker-py#149

Merged

epwalsh added the integration: beaker Relates to the Beaker integration label Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386

Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386

dirkgr commented Sep 5, 2022 •

edited

Loading

dirkgr commented Sep 6, 2022

dirkgr commented Sep 6, 2022

dirkgr commented Sep 6, 2022

epwalsh commented Sep 6, 2022

dirkgr commented Sep 6, 2022

epwalsh commented Sep 6, 2022

Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386

Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386

Comments

dirkgr commented Sep 5, 2022 • edited Loading

🐛 Describe the bug

Versions

dirkgr commented Sep 6, 2022

dirkgr commented Sep 6, 2022

dirkgr commented Sep 6, 2022

epwalsh commented Sep 6, 2022

dirkgr commented Sep 6, 2022

epwalsh commented Sep 6, 2022

dirkgr commented Sep 5, 2022 •

edited

Loading