Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor, add job status polling #34

Merged
merged 1 commit into from
Jan 11, 2022

Conversation

VMois
Copy link

@VMois VMois commented Dec 16, 2021

closes #33

Initial implementation of this PR relied on both file checking and polling in two separate threads. I don't think it was a good idea, so the final version completely relies on the polling job controller.

How to test:

  1. Check out this branch and ensure that snakemake-engine Docker image is updated.

  2. There are 3 main scenarios:

  • workflow success,
  • workflow failed due to an error inside workflow (e.g, incorrect command),
  • workflow failed because k8s or something else decided to stop it (e.g, due to timeout).

You can easily test the first two scenarios. The third one requires more setup, check the review notes below for more details. The most important thing is to check that no bugs were introduced.

  1. Check if the progress state is displayed correctly in reana-client list --include-progress

  2. Check if engine and jobs logs are displayed

Review notes:

  • I checked that the kubernetes_job_timeout parameter will work with this PR

  • I tested it on reana-demo-cms-h4l and reana-demo-root6 workflows

@codecov-commenter
Copy link

Codecov Report

Merging #34 (ae76a58) into master (7269f5a) will decrease coverage by 0.66%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #34      +/-   ##
=========================================
- Coverage    4.92%   4.26%   -0.67%     
=========================================
  Files           6       6              
  Lines         142     164      +22     
=========================================
  Hits            7       7              
- Misses        135     157      +22     
Impacted Files Coverage Δ
reana_workflow_engine_snakemake/executor.py 0.00% <0.00%> (ø)

@VMois VMois force-pushed the poll-job-status branch 2 times, most recently from 5622fbe to 45d6aa3 Compare December 17, 2021 11:59
@VMois VMois force-pushed the poll-job-status branch 3 times, most recently from 14ecae2 to bf5c9b6 Compare December 20, 2021 09:51
@VMois VMois requested a review from mvidalgarcia December 20, 2021 09:55
@VMois VMois marked this pull request as ready for review December 20, 2021 09:55
@VMois VMois changed the title executor, add job status polling WIP: executor, add job status polling Jan 5, 2022
@VMois VMois removed the request for review from mvidalgarcia January 5, 2022 14:32
@VMois VMois force-pushed the poll-job-status branch 2 times, most recently from b40ed11 to aef617c Compare January 5, 2022 17:09
@VMois VMois force-pushed the poll-job-status branch 2 times, most recently from 7e0b09d to 7f4a8b4 Compare January 6, 2022 10:36
@VMois VMois changed the title WIP: executor, add job status polling executor, add job status polling Jan 6, 2022
@VMois VMois requested a review from mvidalgarcia January 6, 2022 14:13
Copy link
Member

@mvidalgarcia mvidalgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job status polling works as expected, left some commets.

The commit message doesn't follow our convention.

Tested the following scenarios:

  • Successful workflow
  • Failed workflow due to wrong command in a step
  • Failed workflow due to timeout (I had to copy-paste the diff from here, I guess it's OK?)
  • Failed workflow due to wrong params in Snakefile: The logs are not properly propagated to reana-client.

Workflow engine logs:

...
2022-01-07 15:26:49,318 | snakemake.logging | MainThread | ERROR | RuleException in line 19 of /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/206c2327-0cf7-4ba7-b580-6e08413d72dd/workflow/snakemake/Snakefile:
'Params' object has no attribute 'events'  File "/code/reana_workflow_engine_snakemake/executor.py", line 60, in run
...

reana-client logs:

==> Workflow engine logs
Workflow exited unexpectedly.

The logs from failed jobs are still gone, I assume is part of this issue.

CHANGES.rst Outdated
Version 0.8.1 (UNRELEASED)
---------------------------

- Uses polling job-controller to obtain job statuses instead of checking files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Uses" is not part of our changelog vocabulary (Adds/Changes/Fixes/Removes).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you propose in this case? Some ideas:

  1. two lines like:
- Removes checking files to determine job status
- Adds polling job-controller to determine job statuses
  1. one like:
- Adds polling job-controller to determine job statuses instead of checking files

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the second point to have related changes in one line.

@@ -154,36 +143,79 @@ def _get_container_image(self, job: Job) -> str:
log.info(f"No environment specified, falling back to: {container_image}")
return container_image

def _handle_job_status(self, job, status):
def _handle_job_status(
self, job, job_status: JobStatus, workflow_status: RunStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have the job type here? Same applies for other methods.


def _get_job_status_from_controller(self, job_id: str) -> str:
response = self.rjc_api_client.check_status(job_id)
return response["status"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use response.get("status") just in case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if r-j-c returns 500 error, response["status"] will fail and trigger workflow error. If response.get("status") is used None will be returned and workflow will be stuck in running status. Let me check when r-j-c might throw 500 errors.

Copy link
Author

@VMois VMois Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a fun thing I have just discovered, you cannot use response.get cause response is not a dictionary. It is the object Job generated by bravado I guess.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use try/catch with AttributeError to check if the status field exists. This is a rare event as it can only be caused by a response that doesn't contain 200 or 404 errors.

reana_workflow_engine_snakemake/executor.py Outdated Show resolved Hide resolved

try:
status = self._get_job_status_from_controller(job_id)
except Exception as error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too broad, maybe we can be more specific catching the exceptions?

Copy link
Author

@VMois VMois Jan 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know what more specific exception to use here as I don't know what possible errors _get_job_status_from_controller can throw. I will try to find something, but if not, I will keep a broad exception for safety. Do you have any ideas?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three scenarios I can think of:

  • network failure between engine and r-j-c, unlikely as they are on the same machine;

  • response 404, handled by r-j-c client so need to catch "HTTPNotFound" exception

  • response 500, not handled by r-j-c client, it is actually silenced (line); in this case, the response content will not be correct, so response["status"] might throw a key error. But, it can be fixed with .get("status").

Copy link
Author

@VMois VMois Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make try/catch blocks more specific and clear. I moved this logic to get_job_status_from_controller function. In case of an error, the function will return failed status for a job.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: rjc_api_client.check_status method is misleading. Docstring says that method checks the status of a job, but it actually returns the response of the get_job API with some additional checks for 404 errors. Just wanted to point that out.

reana_workflow_engine_snakemake/executor.py Outdated Show resolved Hide resolved
@VMois
Copy link
Author

VMois commented Jan 9, 2022

The logs from failed jobs are still gone, I assume is part of this issue.

Not sure about that, might be more related to this issue. Could you, please, share what did you do exactly to get this error?

@VMois VMois force-pushed the poll-job-status branch 2 times, most recently from 5f21b69 to a2d8b26 Compare January 10, 2022 09:27
@mvidalgarcia
Copy link
Member

The logs from failed jobs are still gone, I assume is part of this issue.

Not sure about that, might be more related to this issue. Could you, please, share what did you do exactly to get this error?

Yeah, might be. This problem wasn't there before because we performed many tests involving wrong memory values and it was reported back to the user. Here is the validation logic. Could it be related to the new changes?

In addition, checking the workflow-engine logs, I see that when the job is reported as failed, we don't capture its ID properly.

$ kubectl logs reana-run-batch-foo workflow-engine
...
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Job 'helloworld' received, command: python code/helloworld.py --inputfile data/names.txt --outputfile results/greetings.txt --sleeptime 0
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Environment: python:2.7-slim
2022-01-10 11:17:44,504 | reana-workflow-engine-snakemake | MainThread | ERROR | Error submitting job helloworld: Job submission error: The "kubernetes_memory_limit" provided 128MB has wrong format.
2022-01-10 11:17:44,507 | reana-workflow-engine-snakemake | MainThread | INFO | failed job: None
...

@VMois
Copy link
Author

VMois commented Jan 10, 2022

This problem wasn't there before because we performed many tests involving wrong memory values and it was reported back to the user.

It is reported in Yadage and Serial but not in Snakemake (and, I think, also CWL). You can test a wrong value with Snakemake master branch. There will be no logs with reana-client logs.

In addition, checking the workflow-engine logs, I see that when the job is reported as failed, we don't capture its ID properly.

Same, it happens with the master branch too. This is due to a reason that submission fails so the job doesn't have an ID.

@VMois VMois force-pushed the poll-job-status branch 2 times, most recently from 46c722a to 552b1e4 Compare January 10, 2022 13:17
@VMois VMois requested a review from mvidalgarcia January 10, 2022 15:38
workflow_uuid = os.getenv("workflow_uuid", "default")
job_id = job.reana_job_id
log.info(f"{status} job: {job_id}")
log.info(f"{job_status} job: {job_id}")
Copy link
Member

@mvidalgarcia mvidalgarcia Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints something like JobStatus.finished job: af5b1b8d-6683-46ec-9fd9-f40b27e7b0b9 or JobStatus.failed job: None. Should we access .name?

Perhaps for the failed ones we can print the name instead, or always both id and name? It's not very helpful to get None there.

Copy link
Author

@VMois VMois Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I aligned JobStatus here to match JobStatus enum in reana-db in case we will unify stuff later and forgot to update this line.

It should be JobStatus.finished job: <job-id> but I will change message to f"{job.name} job is {job_status.name}. job id: {job_id}". It will look like "helloworld job is failed. job_id: af5b1b8d-6683-46ec-9fd9-f40b27e7b0b9" or "helloworld job is failed. job_id: None".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""Override job success method to publish job status."""
# override handle_touch = True, to enable `touch()` in Snakefiles
# `touch()` is responsible for checking output files existence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, touch() is just the Unix touch that creates empty files, so it can denote that a certain task was completed.

Copy link
Author

@VMois VMois Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I will disable the touch() (handle_touch=True) then the CMS-h4l demo is failing so I think it is checking output files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the docstring, it doesn't seem 100% accurate to me. No comments about functionality, we need to keep handle_touch=True as you've described.

# `touch()` is responsible for checking output files existence

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what do you propose, remove the comment or rewrite it to be more accurate?

- poll job-controller for job statuses instead of checking .jobfinished/.jobfailed files.

closes reanahub#33
@mvidalgarcia mvidalgarcia merged commit a008afc into reanahub:master Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add job status polling
3 participants