Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows status update issue: DB says 'running' even though K8S pod failed #430

Closed
tiborsimko opened this issue Jan 25, 2022 · 3 comments · Fixed by reanahub/reana-db#159
Assignees

Comments

@tiborsimko
Copy link
Member

Observed several workflow status update failures on a REANA 0.8.0 cluster. The workflow runtime pod run-batch-... died, but the workflow is still reported as being "running" in the DB.

Here are logs from job status consumer for one such example:

$ kubectl logs reana-workflow-controller-7cf76bc4d9-gvxfs job-status-consumer | grep -C 10 wwwwww
2022-01-11 18:47:15,763 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:02,674 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:49,676 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:49:05,179 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,842 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,884 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.failed
du: cannot access '/var/reana/users/uuuuuu/workflows/wwwwww/workflow_ph/pythia_16/mg_processes/signal/Events/run_01/unweighted_events.lhe': No such file or directory
2022-01-11 18:52:34,644 | root | MainThread | ERROR | Unexpected error while processing workflow: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 104, in on_message
    _update_workflow_status(workflow, next_status, logs)
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 142, in _update_workflow_status
    Workflow.update_workflow_status(Session, workflow.id_, status, logs, None)
  File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 664, in update_workflow_status
    raise e
  File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 659, in update_workflow_status
    workflow.status = status
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 279, in __set__
--
  File "/usr/local/lib/python3.8/site-packages/reana_db/utils.py", line 273, in get_disk_usage_or_zero
    disk_bytes = get_disk_usage(workspace_path, summarize=True)
  File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 305, in get_disk_usage
    disk_usage_info = get_disk_usage_info_paths(directory, command, name_filter)
  File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 270, in get_disk_usage_info_paths
    disk_usage_info = subprocess.check_output(command).decode().split()
  File "/usr/local/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/local/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.

Note the exception due to du command. As a result, the status of workflow wwwwww wasn't updated in the DB, and it is still reported as "running" there.

Since we consider DB to be the single-source-of-truth for the status of workflows, it'll be good to catch all these possible problems, not only for the du style exception, but for any other exception that may occur, so that the workflow status in DB is properly updated and corresponds to the K8S pod status.

@VMois
Copy link

VMois commented Jan 25, 2022

Possibly related issue - reanahub/reana-commons#303

@VMois VMois self-assigned this Jan 28, 2022
@VMois
Copy link

VMois commented Jan 28, 2022

The fastest solution for this particular issue is to catch exceptions in the workflow_status_change_listener function in reana_db/models.py. This looks like the only hook on the Workflow model.

Ensuring that we do not have any such exception and DB always equals k8s pod status will require some more work.

I will prepare a PR to fix workflow_status_change_listener and, also, I will check job-status-consumer and where we might have problems with an inconsistent state.

@VMois
Copy link

VMois commented Jan 31, 2022

Instead of catching exceptions in the event listener, another possibility is to improve consumer.py.

In case the event listener fails and aborts DB commit, we can catch this in consumer.py and re-queue a message. This assumes that the error will be gone on the next try. In addition, we will also need to implement the mechanism to check how many times the message was retried and do something when a limit is reached. (can be related to #423)

I think an easier solution is better for now. But, in case of error, disk quota will only be updated on the next workflow run or by nightly updater which might not be optimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants