workflows status update issue: DB says 'running' even though K8S pod failed #430

tiborsimko · 2022-01-25T10:24:54Z

Observed several workflow status update failures on a REANA 0.8.0 cluster. The workflow runtime pod run-batch-... died, but the workflow is still reported as being "running" in the DB.

Here are logs from job status consumer for one such example:

$ kubectl logs reana-workflow-controller-7cf76bc4d9-gvxfs job-status-consumer | grep -C 10 wwwwww
2022-01-11 18:47:15,763 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:02,674 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:49,676 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:49:05,179 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,842 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,884 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.failed
du: cannot access '/var/reana/users/uuuuuu/workflows/wwwwww/workflow_ph/pythia_16/mg_processes/signal/Events/run_01/unweighted_events.lhe': No such file or directory
2022-01-11 18:52:34,644 | root | MainThread | ERROR | Unexpected error while processing workflow: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 104, in on_message
    _update_workflow_status(workflow, next_status, logs)
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 142, in _update_workflow_status
    Workflow.update_workflow_status(Session, workflow.id_, status, logs, None)
  File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 664, in update_workflow_status
    raise e
  File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 659, in update_workflow_status
    workflow.status = status
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 279, in __set__
--
  File "/usr/local/lib/python3.8/site-packages/reana_db/utils.py", line 273, in get_disk_usage_or_zero
    disk_bytes = get_disk_usage(workspace_path, summarize=True)
  File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 305, in get_disk_usage
    disk_usage_info = get_disk_usage_info_paths(directory, command, name_filter)
  File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 270, in get_disk_usage_info_paths
    disk_usage_info = subprocess.check_output(command).decode().split()
  File "/usr/local/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/local/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.

Note the exception due to du command. As a result, the status of workflow wwwwww wasn't updated in the DB, and it is still reported as "running" there.

Since we consider DB to be the single-source-of-truth for the status of workflows, it'll be good to catch all these possible problems, not only for the du style exception, but for any other exception that may occur, so that the workflow status in DB is properly updated and corresponds to the K8S pod status.

The text was updated successfully, but these errors were encountered:

VMois · 2022-01-25T10:33:49Z

Possibly related issue - reanahub/reana-commons#303

VMois · 2022-01-28T12:42:48Z

The fastest solution for this particular issue is to catch exceptions in the workflow_status_change_listener function in reana_db/models.py. This looks like the only hook on the Workflow model.

Ensuring that we do not have any such exception and DB always equals k8s pod status will require some more work.

I will prepare a PR to fix workflow_status_change_listener and, also, I will check job-status-consumer and where we might have problems with an inconsistent state.

VMois · 2022-01-31T10:09:02Z

Instead of catching exceptions in the event listener, another possibility is to improve consumer.py.

In case the event listener fails and aborts DB commit, we can catch this in consumer.py and re-queue a message. This assumes that the error will be gone on the next try. In addition, we will also need to implement the mechanism to check how many times the message was retried and do something when a limit is reached. (can be related to #423)

I think an easier solution is better for now. But, in case of error, disk quota will only be updated on the next workflow run or by nightly updater which might not be optimal.

tiborsimko added the triage/fastlane label Jan 25, 2022

VMois self-assigned this Jan 28, 2022

VMois mentioned this issue Jan 31, 2022

models: catch exceptions in Workflow status listener reanahub/reana-db#159

Merged

tiborsimko closed this as completed in reanahub/reana-db@52d5a22 Feb 1, 2022

VMois mentioned this issue Feb 1, 2022

job-status-consumer: better catch exceptions #423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflows status update issue: DB says 'running' even though K8S pod failed #430

workflows status update issue: DB says 'running' even though K8S pod failed #430

tiborsimko commented Jan 25, 2022

VMois commented Jan 25, 2022

VMois commented Jan 28, 2022 •

edited

Loading

VMois commented Jan 31, 2022 •

edited

Loading

workflows status update issue: DB says 'running' even though K8S pod failed #430

workflows status update issue: DB says 'running' even though K8S pod failed #430

Comments

tiborsimko commented Jan 25, 2022

VMois commented Jan 25, 2022

VMois commented Jan 28, 2022 • edited Loading

VMois commented Jan 31, 2022 • edited Loading

VMois commented Jan 28, 2022 •

edited

Loading

VMois commented Jan 31, 2022 •

edited

Loading