executor, add job status polling #34

VMois · 2021-12-16T17:16:45Z

closes #33

Initial implementation of this PR relied on both file checking and polling in two separate threads. I don't think it was a good idea, so the final version completely relies on the polling job controller.

How to test:

Check out this branch and ensure that snakemake-engine Docker image is updated.
There are 3 main scenarios:

workflow success,
workflow failed due to an error inside workflow (e.g, incorrect command),
workflow failed because k8s or something else decided to stop it (e.g, due to timeout).

You can easily test the first two scenarios. The third one requires more setup, check the review notes below for more details. The most important thing is to check that no bugs were introduced.

Check if the progress state is displayed correctly in reana-client list --include-progress
Check if engine and jobs logs are displayed

Review notes:

I checked that the kubernetes_job_timeout parameter will work with this PR
I tested it on reana-demo-cms-h4l and reana-demo-root6 workflows

codecov-commenter · 2021-12-16T17:18:49Z

Codecov Report

Merging #34 (ae76a58) into master (7269f5a) will decrease coverage by 0.66%.
The diff coverage is 0.00%.

@@            Coverage Diff            @@
##           master     #34      +/-   ##
=========================================
- Coverage    4.92%   4.26%   -0.67%     
=========================================
  Files           6       6              
  Lines         142     164      +22     
=========================================
  Hits            7       7              
- Misses        135     157      +22

Impacted Files	Coverage Δ
reana_workflow_engine_snakemake/executor.py	`0.00% <0.00%> (ø)`

reana_workflow_engine_snakemake/executor.py

reana_workflow_engine_snakemake/config.py

reana_workflow_engine_snakemake/executor.py

mvidalgarcia

Job status polling works as expected, left some commets.

The commit message doesn't follow our convention.

Tested the following scenarios:

Successful workflow
Failed workflow due to wrong command in a step
Failed workflow due to timeout (I had to copy-paste the diff from here, I guess it's OK?)
Failed workflow due to wrong params in Snakefile: The logs are not properly propagated to reana-client.

Workflow engine logs:

...
2022-01-07 15:26:49,318 | snakemake.logging | MainThread | ERROR | RuleException in line 19 of /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/206c2327-0cf7-4ba7-b580-6e08413d72dd/workflow/snakemake/Snakefile:
'Params' object has no attribute 'events'  File "/code/reana_workflow_engine_snakemake/executor.py", line 60, in run
...

reana-client logs:

==> Workflow engine logs
Workflow exited unexpectedly.

The logs from failed jobs are still gone, I assume is part of this issue.

mvidalgarcia · 2022-01-07T12:37:13Z

CHANGES.rst

+Version 0.8.1 (UNRELEASED)
+---------------------------
+
+- Uses polling job-controller to obtain job statuses instead of checking files.


"Uses" is not part of our changelog vocabulary (Adds/Changes/Fixes/Removes).

What do you propose in this case? Some ideas:

two lines like:

- Removes checking files to determine job status - Adds polling job-controller to determine job statuses

one like:

- Adds polling job-controller to determine job statuses instead of checking files

I went with the second point to have related changes in one line.

mvidalgarcia · 2022-01-07T12:43:05Z

reana_workflow_engine_snakemake/executor.py

@@ -154,36 +143,79 @@ def _get_container_image(self, job: Job) -> str:
            log.info(f"No environment specified, falling back to: {container_image}")
        return container_image

-    def _handle_job_status(self, job, status):
+    def _handle_job_status(
+        self, job, job_status: JobStatus, workflow_status: RunStatus


Can we have the job type here? Same applies for other methods.

mvidalgarcia · 2022-01-07T12:47:06Z

reana_workflow_engine_snakemake/executor.py

+
+    def _get_job_status_from_controller(self, job_id: str) -> str:
+        response = self.rjc_api_client.check_status(job_id)
+        return response["status"]


I would use response.get("status") just in case.

I think if r-j-c returns 500 error, response["status"] will fail and trigger workflow error. If response.get("status") is used None will be returned and workflow will be stuck in running status. Let me check when r-j-c might throw 500 errors.

And a fun thing I have just discovered, you cannot use response.get cause response is not a dictionary. It is the object Job generated by bravado I guess.

I will use try/catch with AttributeError to check if the status field exists. This is a rare event as it can only be caused by a response that doesn't contain 200 or 404 errors.

reana_workflow_engine_snakemake/executor.py

mvidalgarcia · 2022-01-07T12:53:41Z

reana_workflow_engine_snakemake/executor.py

+
+                try:
+                    status = self._get_job_status_from_controller(job_id)
+                except Exception as error:


Too broad, maybe we can be more specific catching the exceptions?

I do not know what more specific exception to use here as I don't know what possible errors _get_job_status_from_controller can throw. I will try to find something, but if not, I will keep a broad exception for safety. Do you have any ideas?

Three scenarios I can think of:

network failure between engine and r-j-c, unlikely as they are on the same machine;

response 404, handled by r-j-c client so need to catch "HTTPNotFound" exception

response 500, not handled by r-j-c client, it is actually silenced (line); in this case, the response content will not be correct, so response["status"] might throw a key error. But, it can be fixed with .get("status").

I tried to make try/catch blocks more specific and clear. I moved this logic to get_job_status_from_controller function. In case of an error, the function will return failed status for a job.

Note: rjc_api_client.check_status method is misleading. Docstring says that method checks the status of a job, but it actually returns the response of the get_job API with some additional checks for 404 errors. Just wanted to point that out.

reana_workflow_engine_snakemake/executor.py

VMois · 2022-01-09T21:54:38Z

The logs from failed jobs are still gone, I assume is part of this issue.

Not sure about that, might be more related to this issue. Could you, please, share what did you do exactly to get this error?

mvidalgarcia · 2022-01-10T11:22:43Z

The logs from failed jobs are still gone, I assume is part of this issue.

Not sure about that, might be more related to this issue. Could you, please, share what did you do exactly to get this error?

Yeah, might be. This problem wasn't there before because we performed many tests involving wrong memory values and it was reported back to the user. Here is the validation logic. Could it be related to the new changes?

In addition, checking the workflow-engine logs, I see that when the job is reported as failed, we don't capture its ID properly.

$ kubectl logs reana-run-batch-foo workflow-engine
...
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Job 'helloworld' received, command: python code/helloworld.py --inputfile data/names.txt --outputfile results/greetings.txt --sleeptime 0
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Environment: python:2.7-slim
2022-01-10 11:17:44,504 | reana-workflow-engine-snakemake | MainThread | ERROR | Error submitting job helloworld: Job submission error: The "kubernetes_memory_limit" provided 128MB has wrong format.
2022-01-10 11:17:44,507 | reana-workflow-engine-snakemake | MainThread | INFO | failed job: None
...

VMois · 2022-01-10T12:03:40Z

This problem wasn't there before because we performed many tests involving wrong memory values and it was reported back to the user.

It is reported in Yadage and Serial but not in Snakemake (and, I think, also CWL). You can test a wrong value with Snakemake master branch. There will be no logs with reana-client logs.

In addition, checking the workflow-engine logs, I see that when the job is reported as failed, we don't capture its ID properly.

Same, it happens with the master branch too. This is due to a reason that submission fails so the job doesn't have an ID.

mvidalgarcia · 2022-01-11T11:47:10Z

reana_workflow_engine_snakemake/executor.py

        workflow_uuid = os.getenv("workflow_uuid", "default")
        job_id = job.reana_job_id
-        log.info(f"{status} job: {job_id}")
+        log.info(f"{job_status} job: {job_id}")


This prints something like JobStatus.finished job: af5b1b8d-6683-46ec-9fd9-f40b27e7b0b9 or JobStatus.failed job: None. Should we access .name?

Perhaps for the failed ones we can print the name instead, or always both id and name? It's not very helpful to get None there.

Good catch. I aligned JobStatus here to match JobStatus enum in reana-db in case we will unify stuff later and forgot to update this line.

It should be JobStatus.finished job: <job-id> but I will change message to f"{job.name} job is {job_status.name}. job id: {job_id}". It will look like "helloworld job is failed. job_id: af5b1b8d-6683-46ec-9fd9-f40b27e7b0b9" or "helloworld job is failed. job_id: None".

mvidalgarcia · 2022-01-11T11:51:19Z

reana_workflow_engine_snakemake/executor.py

        """Override job success method to publish job status."""
        # override handle_touch = True, to enable `touch()` in Snakefiles
+        # `touch()` is responsible for checking output files existence


Not really, touch() is just the Unix touch that creates empty files, so it can denote that a certain task was completed.

If I will disable the touch() (handle_touch=True) then the CMS-h4l demo is failing so I think it is checking output files.

Found handle_touch function in Snakemake - https://github.com/snakemake/snakemake/blob/987282dde8a2db5174414988c134a39ae8836a61/snakemake/dag.py#L580

I meant the docstring, it doesn't seem 100% accurate to me. No comments about functionality, we need to keep handle_touch=True as you've described.

# `touch()` is responsible for checking output files existence

So, what do you propose, remove the comment or rewrite it to be more accurate?

- poll job-controller for job statuses instead of checking .jobfinished/.jobfailed files. closes reanahub#33

VMois force-pushed the poll-job-status branch from ae76a58 to 8c8172a Compare December 16, 2021 17:58

VMois commented Dec 16, 2021

View reviewed changes

reana_workflow_engine_snakemake/executor.py Outdated Show resolved Hide resolved

VMois force-pushed the poll-job-status branch 2 times, most recently from 5622fbe to 45d6aa3 Compare December 17, 2021 11:59

mvidalgarcia reviewed Dec 17, 2021

View reviewed changes

reana_workflow_engine_snakemake/config.py Show resolved Hide resolved

VMois force-pushed the poll-job-status branch 3 times, most recently from 14ecae2 to bf5c9b6 Compare December 20, 2021 09:51

VMois requested a review from mvidalgarcia December 20, 2021 09:55

VMois marked this pull request as ready for review December 20, 2021 09:55

VMois force-pushed the poll-job-status branch from bf5c9b6 to 3d2b469 Compare January 5, 2022 14:29

VMois changed the title ~~executor, add job status polling~~ WIP: executor, add job status polling Jan 5, 2022

VMois removed the request for review from mvidalgarcia January 5, 2022 14:32

VMois force-pushed the poll-job-status branch 2 times, most recently from b40ed11 to aef617c Compare January 5, 2022 17:09

VMois commented Jan 5, 2022

View reviewed changes

reana_workflow_engine_snakemake/executor.py Outdated Show resolved Hide resolved

VMois commented Jan 5, 2022

View reviewed changes

reana_workflow_engine_snakemake/executor.py Outdated Show resolved Hide resolved

VMois force-pushed the poll-job-status branch 2 times, most recently from 7e0b09d to 7f4a8b4 Compare January 6, 2022 10:36

VMois changed the title ~~WIP: executor, add job status polling~~ executor, add job status polling Jan 6, 2022

VMois commented Jan 6, 2022

View reviewed changes

reana_workflow_engine_snakemake/executor.py Show resolved Hide resolved

VMois requested a review from mvidalgarcia January 6, 2022 14:13

mvidalgarcia reviewed Jan 7, 2022

View reviewed changes

VMois force-pushed the poll-job-status branch 2 times, most recently from 5f21b69 to a2d8b26 Compare January 10, 2022 09:27

VMois force-pushed the poll-job-status branch from a2d8b26 to c54be8a Compare January 10, 2022 11:39

VMois force-pushed the poll-job-status branch 2 times, most recently from 46c722a to 552b1e4 Compare January 10, 2022 13:17

VMois requested a review from mvidalgarcia January 10, 2022 15:38

mvidalgarcia approved these changes Jan 11, 2022

View reviewed changes

executor: poll job-controller for job statuses

a008afc

- poll job-controller for job statuses instead of checking .jobfinished/.jobfailed files. closes reanahub#33

VMois force-pushed the poll-job-status branch from 552b1e4 to a008afc Compare January 11, 2022 12:23

mvidalgarcia merged commit a008afc into reanahub:master Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

executor, add job status polling #34

executor, add job status polling #34

VMois commented Dec 16, 2021 •

edited

Loading

codecov-commenter commented Dec 16, 2021

mvidalgarcia left a comment

mvidalgarcia Jan 7, 2022

VMois Jan 8, 2022

VMois Jan 10, 2022

mvidalgarcia Jan 7, 2022

mvidalgarcia Jan 7, 2022

VMois Jan 10, 2022

VMois Jan 10, 2022 •

edited

Loading

VMois Jan 10, 2022

mvidalgarcia Jan 7, 2022

VMois Jan 8, 2022 •

edited

Loading

VMois Jan 9, 2022

VMois Jan 10, 2022 •

edited

Loading

VMois Jan 10, 2022

VMois commented Jan 9, 2022

mvidalgarcia commented Jan 10, 2022

VMois commented Jan 10, 2022

mvidalgarcia Jan 11, 2022 •

edited

Loading

VMois Jan 11, 2022 •

edited

Loading

VMois Jan 11, 2022

mvidalgarcia Jan 11, 2022

VMois Jan 11, 2022 •

edited

Loading

VMois Jan 11, 2022

mvidalgarcia Jan 11, 2022

VMois Jan 11, 2022

executor, add job status polling #34

executor, add job status polling #34

Conversation

VMois commented Dec 16, 2021 • edited Loading

codecov-commenter commented Dec 16, 2021

Codecov Report

mvidalgarcia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois Jan 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois Jan 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois Jan 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois commented Jan 9, 2022

mvidalgarcia commented Jan 10, 2022

VMois commented Jan 10, 2022

mvidalgarcia Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

VMois Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VMois commented Dec 16, 2021 •

edited

Loading

VMois Jan 10, 2022 •

edited

Loading

VMois Jan 8, 2022 •

edited

Loading

VMois Jan 10, 2022 •

edited

Loading

mvidalgarcia Jan 11, 2022 •

edited

Loading

VMois Jan 11, 2022 •

edited

Loading

VMois Jan 11, 2022 •

edited

Loading