-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] Check response status in the httpjson
log handler and retry in case of TOO_MANY_REQUESTS
#3356
[bugfix] Check response status in the httpjson
log handler and retry in case of TOO_MANY_REQUESTS
#3356
Conversation
@vkarak I have addressed the comments from the previous PR. Before catching the logging error, Reframe would stop the execution of the tests with a message like this: ...
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
[ PASSED ] Ran 1/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan 8 09:11:44 2025+0100
ERROR: run session stopped: logging error: HTTPJSONhandler logging failed: HTTP response code 429
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out' Now we simply get a message but the tests continue to run: ...
[ OK ] (1/2) SlurmQueueStatusCheck %slurm_partition=normal* /72a54254 @daint:login+builtin
P: available_nodes_percentage: 87.32394366197182 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=normal* @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[ OK ] (2/2) SlurmQueueStatusCheck %slurm_partition=debug /67512ae1 @daint:login+builtin
P: available_nodes_percentage: 96.875 % (r:10.0, l:-0.0001, u:None)
WARNING: could not log performance data for SlurmQueueStatusCheck %slurm_partition=debug @daint:login+builtin: HTTPJSONhandler logging failed: HTTP response code 429
[----------] all spawned checks have finished
[ PASSED ] Ran 2/2 test case(s) from 2 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan 8 09:13:02 2025+0100
Log file(s) saved in '/users/eirinik/reframe/reframe.log', '/users/eirinik/reframe/reframe.out' (I triggered it with 429 and we actually handle this normally, but it should be similar for an error code 500) |
@ekouts Looking in the code, how can this message (I mean with code 429) can be produced? Because we keep poking if we get 429. Unless, we should not retry infinitely with |
It can't 😛 I commented out lines 701-703 so that you see the output.
Hm you are probably right. But I am not sure how much is too long, maybe 1 minute? Some times the rate limit is set per minute so it may take some time to "reset". I cannot tell how much it is in logstash for us. |
What if we make it a configuration parameter along with the list of intervals? |
Makes sense, then the user can add a finite list or a cycle if they want to wait forever. I will leave as default the |
I would rather have the cycle always in the code and a timeout as a separate parameter, as this is more intuitive. If |
Just to be clear, the first parameter is the timeout (float in seconds). Which one is the second parameter, the list that we cycle over? |
Yes, the first one is the timeout after which we give up and issue a warning and the second one are the wait/sleep intervals. |
…bugfix/log_httpjson_errors_2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed sleep_intervals
to backoff_intervals
as it's more accurate, enhanced a bit the docs and fixed some coding style issues. Lgtm now.
Try it once more tomorrow so as to be sure that my changes haven't broken anything and we're good to merge it!
I've also renamed |
@vkarak I tried it and still works as expected, thanks for the fixes :) |
httpjson
log handler and retry in case of TOO_MANY_REQUESTS
Reopening #3354 on a branch based on master.
Fixes #3353 .