Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Check response status in the httpjson log handler and retry in case of TOO_MANY_REQUESTS #3356

Merged
merged 11 commits into from
Jan 17, 2025

Conversation

ekouts
Copy link
Contributor

@ekouts ekouts commented Jan 8, 2025

Reopening #3354 on a branch based on master.
Fixes #3353 .

@ekouts ekouts added this to the ReFrame 4.8 milestone Jan 8, 2025
@ekouts ekouts requested review from vkarak and teojgo January 8, 2025 08:18
@ekouts ekouts self-assigned this Jan 8, 2025
@ekouts
Copy link
Contributor Author

ekouts commented Jan 8, 2025

@vkarak I have addressed the comments from the previous PR. Before catching the logging error, Reframe would stop the execution of the tests with a message like this:

...
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
[  PASSED  ] Ran 1/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan  8 09:11:44 2025+0100
ERROR: run session stopped: logging error: HTTPJSONhandler logging failed: HTTP response code 429
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out'

Now we simply get a message but the tests continue to run:

...
[       OK ] (1/2) SlurmQueueStatusCheck %slurm_partition=normal* /72a54254 @daint:login+builtin
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=normal* @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[       OK ] (2/2) SlurmQueueStatusCheck %slurm_partition=debug /67512ae1 @daint:login+builtin
P: available_nodes_percentage: 96.875 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=debug @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan  8 09:13:02 2025+0100
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out'

(I triggered it with 429 and we actually handle this normally, but it should be similar for an error code 500)

@vkarak
Copy link
Contributor

vkarak commented Jan 8, 2025

WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=debug @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429

@ekouts Looking in the code, how can this message (I mean with code 429) can be produced? Because we keep poking if we get 429. Unless, we should not retry infinitely with iterools.cycle() (which I think is a good idea) and break the loop once we consume the list of intervals.

@ekouts
Copy link
Contributor Author

ekouts commented Jan 9, 2025

@ekouts Looking in the code, how can this message (I mean with code 429) can be produced? Because we keep poking if we get 429.

It can't 😛 I commented out lines 701-703 so that you see the output.

Unless, we should not retry infinitely with iterools.cycle() (which I think is a good idea) and break the loop once we consume the list of intervals.

Hm you are probably right. But I am not sure how much is too long, maybe 1 minute? Some times the rate limit is set per minute so it may take some time to "reset". I cannot tell how much it is in logstash for us.

@vkarak
Copy link
Contributor

vkarak commented Jan 9, 2025

Hm you are probably right. But I am not sure how much is too long, maybe 1 minute?

What if we make it a configuration parameter along with the list of intervals?

@ekouts
Copy link
Contributor Author

ekouts commented Jan 9, 2025

What if we make it a configuration parameter along with the list of intervals?

Makes sense, then the user can add a finite list or a cycle if they want to wait forever. I will leave as default the [.1, .2, .4, .8, 1.6, 3.2] list.

@vkarak
Copy link
Contributor

vkarak commented Jan 9, 2025

Makes sense, then the user can add a finite list or a cycle if they want to wait forever. I will leave as default the [.1, .2, .4, .8, 1.6, 3.2] list.

I would rather have the cycle always in the code and a timeout as a separate parameter, as this is more intuitive. If timeout=0 then it would mean wait until you get served. I think also that these two parameters can be specific to that log handler only.

@ekouts
Copy link
Contributor Author

ekouts commented Jan 14, 2025

I think also that these two parameters can be specific to that log handler only.

Just to be clear, the first parameter is the timeout (float in seconds). Which one is the second parameter, the list that we cycle over?

@vkarak
Copy link
Contributor

vkarak commented Jan 14, 2025

Just to be clear, the first parameter is the timeout (float in seconds). Which one is the second parameter, the list that we cycle over?

Yes, the first one is the timeout after which we give up and issue a warning and the second one are the wait/sleep intervals.

@vkarak vkarak modified the milestones: ReFrame 4.8, ReFrame 4.7.3 Jan 16, 2025
Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed sleep_intervals to backoff_intervals as it's more accurate, enhanced a bit the docs and fixed some coding style issues. Lgtm now.

Try it once more tomorrow so as to be sure that my changes haven't broken anything and we're good to merge it!

@vkarak
Copy link
Contributor

vkarak commented Jan 16, 2025

I've also renamed timeout to retry_timeout and updated the schema accordingly.

@ekouts
Copy link
Contributor Author

ekouts commented Jan 17, 2025

@vkarak I tried it and still works as expected, thanks for the fixes :)

@vkarak vkarak changed the title [bugfix] Improve handling of requests in httpjson logger [bugfix] Check response status in the httpjson log handler and retry in case of TOO_MANY_REQUESTS Jan 17, 2025
@vkarak vkarak enabled auto-merge January 17, 2025 10:47
@vkarak vkarak merged commit 8bccc24 into reframe-hpc:master Jan 17, 2025
36 checks passed
@vkarak vkarak linked an issue Jan 17, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Check response status code in httpjson logger
3 participants