-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor, add job status polling #34
Conversation
Codecov Report
@@ Coverage Diff @@
## master #34 +/- ##
=========================================
- Coverage 4.92% 4.26% -0.67%
=========================================
Files 6 6
Lines 142 164 +22
=========================================
Hits 7 7
- Misses 135 157 +22
|
ae76a58
to
8c8172a
Compare
5622fbe
to
45d6aa3
Compare
14ecae2
to
bf5c9b6
Compare
b40ed11
to
aef617c
Compare
7e0b09d
to
7f4a8b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job status polling works as expected, left some commets.
The commit message doesn't follow our convention.
Tested the following scenarios:
- Successful workflow
- Failed workflow due to wrong command in a step
- Failed workflow due to timeout (I had to copy-paste the diff from here, I guess it's OK?)
- Failed workflow due to wrong
params
in Snakefile: The logs are not properly propagated to reana-client.
Workflow engine logs:
...
2022-01-07 15:26:49,318 | snakemake.logging | MainThread | ERROR | RuleException in line 19 of /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/206c2327-0cf7-4ba7-b580-6e08413d72dd/workflow/snakemake/Snakefile:
'Params' object has no attribute 'events' File "/code/reana_workflow_engine_snakemake/executor.py", line 60, in run
...
reana-client logs:
==> Workflow engine logs
Workflow exited unexpectedly.
The logs from failed jobs are still gone, I assume is part of this issue.
CHANGES.rst
Outdated
Version 0.8.1 (UNRELEASED) | ||
--------------------------- | ||
|
||
- Uses polling job-controller to obtain job statuses instead of checking files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Uses" is not part of our changelog vocabulary (Adds/Changes/Fixes/Removes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you propose in this case? Some ideas:
- two lines like:
- Removes checking files to determine job status
- Adds polling job-controller to determine job statuses
- one like:
- Adds polling job-controller to determine job statuses instead of checking files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the second point to have related changes in one line.
@@ -154,36 +143,79 @@ def _get_container_image(self, job: Job) -> str: | |||
log.info(f"No environment specified, falling back to: {container_image}") | |||
return container_image | |||
|
|||
def _handle_job_status(self, job, status): | |||
def _handle_job_status( | |||
self, job, job_status: JobStatus, workflow_status: RunStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have the job
type here? Same applies for other methods.
|
||
def _get_job_status_from_controller(self, job_id: str) -> str: | ||
response = self.rjc_api_client.check_status(job_id) | ||
return response["status"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use response.get("status")
just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if r-j-c returns 500 error, response["status"]
will fail and trigger workflow error. If response.get("status")
is used None
will be returned and workflow will be stuck in running
status. Let me check when r-j-c might throw 500 errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And a fun thing I have just discovered, you cannot use response.get
cause response
is not a dictionary. It is the object Job
generated by bravado I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use try/catch
with AttributeError
to check if the status
field exists. This is a rare event as it can only be caused by a response that doesn't contain 200 or 404 errors.
|
||
try: | ||
status = self._get_job_status_from_controller(job_id) | ||
except Exception as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too broad, maybe we can be more specific catching the exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know what more specific exception to use here as I don't know what possible errors _get_job_status_from_controller
can throw. I will try to find something, but if not, I will keep a broad exception for safety. Do you have any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three scenarios I can think of:
-
network failure between engine and r-j-c, unlikely as they are on the same machine;
-
response 404, handled by r-j-c client so need to catch "HTTPNotFound" exception
-
response 500, not handled by r-j-c client, it is actually silenced (line); in this case, the response content will not be correct, so
response["status"]
might throw a key error. But, it can be fixed with.get("status")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make try/catch blocks more specific and clear. I moved this logic to get_job_status_from_controller
function. In case of an error, the function will return failed status for a job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: rjc_api_client.check_status
method is misleading. Docstring says that method checks the status of a job, but it actually returns the response of the get_job
API with some additional checks for 404 errors. Just wanted to point that out.
Not sure about that, might be more related to this issue. Could you, please, share what did you do exactly to get this error? |
5f21b69
to
a2d8b26
Compare
Yeah, might be. This problem wasn't there before because we performed many tests involving wrong memory values and it was reported back to the user. Here is the validation logic. Could it be related to the new changes? In addition, checking the $ kubectl logs reana-run-batch-foo workflow-engine
...
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Job 'helloworld' received, command: python code/helloworld.py --inputfile data/names.txt --outputfile results/greetings.txt --sleeptime 0
2022-01-10 11:17:44,222 | reana-workflow-engine-snakemake | MainThread | INFO | Environment: python:2.7-slim
2022-01-10 11:17:44,504 | reana-workflow-engine-snakemake | MainThread | ERROR | Error submitting job helloworld: Job submission error: The "kubernetes_memory_limit" provided 128MB has wrong format.
2022-01-10 11:17:44,507 | reana-workflow-engine-snakemake | MainThread | INFO | failed job: None
... |
a2d8b26
to
c54be8a
Compare
It is reported in Yadage and Serial but not in Snakemake (and, I think, also CWL). You can test a wrong value with Snakemake
Same, it happens with the |
46c722a
to
552b1e4
Compare
workflow_uuid = os.getenv("workflow_uuid", "default") | ||
job_id = job.reana_job_id | ||
log.info(f"{status} job: {job_id}") | ||
log.info(f"{job_status} job: {job_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prints something like JobStatus.finished job: af5b1b8d-6683-46ec-9fd9-f40b27e7b0b9
or JobStatus.failed job: None
. Should we access .name
?
Perhaps for the failed ones we can print the name instead, or always both id and name? It's not very helpful to get None
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I aligned JobStatus here to match JobStatus enum in reana-db in case we will unify stuff later and forgot to update this line.
It should be JobStatus.finished job: <job-id>
but I will change message to f"{job.name} job is {job_status.name}. job id: {job_id}"
. It will look like "helloworld job is failed. job_id: af5b1b8d-6683-46ec-9fd9-f40b27e7b0b9" or "helloworld job is failed. job_id: None".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"""Override job success method to publish job status.""" | ||
# override handle_touch = True, to enable `touch()` in Snakefiles | ||
# `touch()` is responsible for checking output files existence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, touch()
is just the Unix touch
that creates empty files, so it can denote that a certain task was completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I will disable the touch()
(handle_touch=True
) then the CMS-h4l demo is failing so I think it is checking output files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found handle_touch
function in Snakemake - https://github.com/snakemake/snakemake/blob/987282dde8a2db5174414988c134a39ae8836a61/snakemake/dag.py#L580
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the docstring, it doesn't seem 100% accurate to me. No comments about functionality, we need to keep handle_touch=True
as you've described.
# `touch()` is responsible for checking output files existence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what do you propose, remove the comment or rewrite it to be more accurate?
- poll job-controller for job statuses instead of checking .jobfinished/.jobfailed files. closes reanahub#33
552b1e4
to
a008afc
Compare
closes #33
Initial implementation of this PR relied on both file checking and polling in two separate threads. I don't think it was a good idea, so the final version completely relies on the polling job controller.
How to test:
Check out this branch and ensure that
snakemake-engine
Docker image is updated.There are 3 main scenarios:
success
,failed
due to an error inside workflow (e.g, incorrect command),failed
because k8s or something else decided to stop it (e.g, due to timeout).You can easily test the first two scenarios. The third one requires more setup, check the review notes below for more details. The most important thing is to check that no bugs were introduced.
Check if the progress state is displayed correctly in
reana-client list --include-progress
Check if engine and jobs logs are displayed
Review notes:
I checked that the
kubernetes_job_timeout
parameter will work with this PRI tested it on
reana-demo-cms-h4l
andreana-demo-root6
workflows