-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running a Beaker Executor job leaves loads of uncommitted datasets in the workspace #386
Comments
I did another run like this. This time only one of them failed. The results table points me to this dataset: https://beaker.org/ds/01GC7PCX5M9B357GX6YJFY7C5R/details. This is clearly incomplete. Its presence will prevent a re-run from succeeding. I know of no way to find the logs that would show me why it failed. |
Ah! I can search experiments by name, which reveals this error message:
So that explains why it's always two that fail. We're writing stuff to beaker too quickly. |
I will leave all this up like it is in case it helps with debugging. |
What happens here? Does it hang? Do you know where it hangs? I'm guessing the 429 error you discovered leaves this dataset in a bad state somehow. |
The job finishes with (in this case) one failure and a few dependent steps not run. The problems with how this goes down are these:
|
What do you mean get stuck? |
🐛 Describe the bug
Run the catwalk training job specified here: allenai/catwalk@5ba0192
Command line is
tango --settings experiments/train_all_the_things/tango.yml run experiments/train_all_the_things/train_all_the_things.jsonnet
.It will run for quite a while. Two jobs will usually fail. I don't know why it's always two.
The actual problem here is the undiagnosable failure of the original jobs, and then the undiagnosable hang (which can be fixed if you remove the right uncommitted datasets). From a user perspective, having all those uncommitted datasets doesn't matter as such, but I suspect that it points to a deeper issue.
Versions
asd
The text was updated successfully, but these errors were encountered: