Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reana-run-batch NotReady due to job-controller monitor TypeError #357

Open
tiborsimko opened this issue Mar 23, 2022 · 3 comments
Open

reana-run-batch NotReady due to job-controller monitor TypeError #357

tiborsimko opened this issue Mar 23, 2022 · 3 comments

Comments

@tiborsimko
Copy link
Member

Seen with REANA 0.8.

A workflow finished fine at time T1, meaning that the workspace was properly populated, and the reana-run-batch workflow-engine logs say:

2022-03-22 20:36:00,185 | adage | MainThread | INFO | workflow completed successfully.
2022-03-22 20:36:00,185 | yadage.steering_api | MainThread | INFO | done. dumping workflow to disk.
2022-03-22 20:36:00,191 | yadage.steering_api | MainThread | INFO | visualizing workflow.
2022-03-22 20:36:01,135 | reana-workflow-engine-yadage | MainThread | INFO | Workflow finished. Publishing status...
2022-03-22 20:36:01,137 | root | MainThread | INFO | Workflow 2e443b44-89da-447b-8898-7c9890fa3569 finished. Files available at /var/reana/users/UUUU/workflows/2e443b44-89da-447b-8898-7c9890fa3569.

However, the job-controller logs say there was an error:

2022-03-22 20:34:57,271 | werkzeug | Thread-3574 | INFO | 127.0.0.1 - - [22/Mar/2022 20:34:57] "GET /jobs/5119bd03-6045-409a-9ea5-0552be0832f7 HTTP/1.1" 200 -
2022-03-22 20:35:14,187 | werkzeug | Thread-3575 | INFO | 127.0.0.1 - - [22/Mar/2022 20:35:14] "GET /jobs/5119bd03-6045-409a-9ea5-0552be0832f7 HTTP/1.1" 200 -
2022-03-22 20:35:23,799 | root | kubernetes_job_monitor | INFO | New Pod event received: MODIFIED
2022-03-22 20:35:23,799 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-29e54c41-3385-4d7d-963d-1689693b1547 succeeded.
2022-03-22 20:35:23,799 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-29e54c41-3385-4d7d-963d-1689693b1547 succeeded.
2022-03-22 20:35:23,799 | root | kubernetes_job_monitor | INFO | Grabbing pod reana-run-job-29e54c41-3385-4d7d-963d-1689693b1547-smxk7 logs ...
2022-03-22 20:35:23,916 | root | kubernetes_job_monitor | INFO | Storing job logs: 5119bd03-6045-409a-9ea5-0552be0832f7
2022-03-22 20:35:23,959 | root | kubernetes_job_monitor | INFO | Cleaning Kubernetes job reana-run-job-29e54c41-3385-4d7d-963d-1689693b1547 ...
2022-03-22 20:35:24,014 | root | kubernetes_job_monitor | INFO | New Pod event received: MODIFIED
2022-03-22 20:35:24,014 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-29e54c41-3385-4d7d-963d-1689693b1547 succeeded.
2022-03-22 20:35:24,018 | root | kubernetes_job_monitor | INFO | New Pod event received: DELETED
2022-03-22 20:35:24,018 | root | kubernetes_job_monitor | INFO | Kubernetes job id: reana-run-job-29e54c41-3385-4d7d-963d-1689693b1547 succeeded.
2022-03-22 20:35:31,097 | werkzeug | Thread-3576 | INFO | 127.0.0.1 - - [22/Mar/2022 20:35:31] "GET /jobs/5119bd03-6045-409a-9ea5-0552be0832f7 HTTP/1.1" 200 -
2022-03-22 21:28:26,138 | root | kubernetes_job_monitor | INFO | New Pod event received: ERROR
2022-03-22 21:28:26,140 | root | kubernetes_job_monitor | ERROR | Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/reana_job_controller/job_monitor.py", line 281, in watch_jobs
    if self.should_process_job(job_pod):
  File "/usr/local/lib/python3.8/site-packages/reana_job_controller/job_monitor.py", line 129, in should_process_job
    backend_job_id = self.get_backend_job_id(job_pod)
  File "/usr/local/lib/python3.8/site-packages/reana_job_controller/job_monitor.py", line 97, in get_backend_job_id
    return job_pod.metadata.labels["job-name"]
TypeError: 'NoneType' object is not subscriptable

2022-03-22 21:28:26,140 | root | kubernetes_job_monitor | ERROR | Unexpected error: 'NoneType' object is not subscriptable

Due to this, the reana-run-batch pod stays in NotReady state and the DB sees it as still "running".

Such a "stuck" workflow leads to scheduling troubles.

We should get the job monitor ready to handle this exception.

Note also the huge time delay between workflow termination (20:36:01) and pod deletion event (21:28:26).

@tiborsimko
Copy link
Member Author

Note: saw the same problem for several workflows where the workflow status was still in "pending".

@tiborsimko
Copy link
Member Author

There were also a few cases of the following, which is not helping either:

2022-03-22 21:31:24,944 | root | MainThread | INFO |  [x] Received workflow_uuid: 30c66edb-75f7-4760-92a6-f883c56686af status: RunStatus.running
2022-03-22 21:31:24,970 | root | MainThread | INFO |  [x] Received workflow_uuid: 30c66edb-75f7-4760-92a6-f883c56686af status: RunStatus.finished
2022-03-22 21:34:05,407 | root | MainThread | ERROR | Unexpected error while processing workflow: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 83, in on_message
    message.ack()
  File "/usr/local/lib/python3.8/site-packages/kombu/message.py", line 123, in ack
    self.channel.basic_ack(self.delivery_tag, multiple=multiple)
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1395, in basic_ack
    return self.send_method(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 57, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 183, in write_frame
    write(view[:offset])
  File "/usr/local/lib/python3.8/site-packages/amqp/transport.py", line 363, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer

@VMois
Copy link

VMois commented Mar 24, 2022

Looks like job-controller is running after having an error (it has "Running" status). This is actually a workflow-engine pod that is finished causing a "Completed" or "NotReady" state.

My assumption is workflow-engine sends a message to the job-status queue at the end, but job-status-consumer fails to process it. We have examples where quota update fails (related reanahub/reana-commons#303), it might be the cause of a few "NotReady" batch pods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants