You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observed several workflow status update failures on a REANA 0.8.0 cluster. The workflow runtime pod run-batch-... died, but the workflow is still reported as being "running" in the DB.
Here are logs from job status consumer for one such example:
$ kubectl logs reana-workflow-controller-7cf76bc4d9-gvxfs job-status-consumer | grep -C 10 wwwwww
2022-01-11 18:47:15,763 | root | MainThread | INFO | [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:02,674 | root | MainThread | INFO | [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:49,676 | root | MainThread | INFO | [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:49:05,179 | root | MainThread | INFO | [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,842 | root | MainThread | INFO | [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,884 | root | MainThread | INFO | [x] Received workflow_uuid: wwwwww status: RunStatus.failed
du: cannot access '/var/reana/users/uuuuuu/workflows/wwwwww/workflow_ph/pythia_16/mg_processes/signal/Events/run_01/unweighted_events.lhe': No such file or directory
2022-01-11 18:52:34,644 | root | MainThread | ERROR | Unexpected error while processing workflow: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 104, in on_message
_update_workflow_status(workflow, next_status, logs)
File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 142, in _update_workflow_status
Workflow.update_workflow_status(Session, workflow.id_, status, logs, None)
File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 664, in update_workflow_status
raise e
File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 659, in update_workflow_status
workflow.status = status
File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 279, in __set__
--
File "/usr/local/lib/python3.8/site-packages/reana_db/utils.py", line 273, in get_disk_usage_or_zero
disk_bytes = get_disk_usage(workspace_path, summarize=True)
File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 305, in get_disk_usage
disk_usage_info = get_disk_usage_info_paths(directory, command, name_filter)
File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 270, in get_disk_usage_info_paths
disk_usage_info = subprocess.check_output(command).decode().split()
File "/usr/local/lib/python3.8/subprocess.py", line 415, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/local/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.
Note the exception due to du command. As a result, the status of workflow wwwwww wasn't updated in the DB, and it is still reported as "running" there.
Since we consider DB to be the single-source-of-truth for the status of workflows, it'll be good to catch all these possible problems, not only for the du style exception, but for any other exception that may occur, so that the workflow status in DB is properly updated and corresponds to the K8S pod status.
The text was updated successfully, but these errors were encountered:
The fastest solution for this particular issue is to catch exceptions in the workflow_status_change_listener function in reana_db/models.py. This looks like the only hook on the Workflow model.
Ensuring that we do not have any such exception and DB always equals k8s pod status will require some more work.
I will prepare a PR to fix workflow_status_change_listener and, also, I will check job-status-consumer and where we might have problems with an inconsistent state.
Instead of catching exceptions in the event listener, another possibility is to improve consumer.py.
In case the event listener fails and aborts DB commit, we can catch this in consumer.py and re-queue a message. This assumes that the error will be gone on the next try. In addition, we will also need to implement the mechanism to check how many times the message was retried and do something when a limit is reached. (can be related to #423)
I think an easier solution is better for now. But, in case of error, disk quota will only be updated on the next workflow run or by nightly updater which might not be optimal.
Observed several workflow status update failures on a REANA 0.8.0 cluster. The workflow runtime pod
run-batch-...
died, but the workflow is still reported as being "running" in the DB.Here are logs from job status consumer for one such example:
Note the exception due to
du
command. As a result, the status of workflowwwwwww
wasn't updated in the DB, and it is still reported as "running" there.Since we consider DB to be the single-source-of-truth for the status of workflows, it'll be good to catch all these possible problems, not only for the
du
style exception, but for any other exception that may occur, so that the workflow status in DB is properly updated and corresponds to the K8S pod status.The text was updated successfully, but these errors were encountered: