Skip to content

api: Add random exponential backoff to do_api_query. #538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

orientor
Copy link
Collaborator

@orientor orientor commented Feb 23, 2020

Set delay time on failure in do_api_query according to random exponential backoff.

Fixes #537 .

@orientor
Copy link
Collaborator Author

Made a PR for do_api_query. But base and cap need to be decided. Also if this is done, then I can change class RandomExponentialBackoff accordingly.

time.sleep(1)
delay_cap = 10
delay_base = 0.5
delay_time = random.random() * min(delay_cap, delay_base * (2 ** query_state["failures"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orientor Thanks for working on this! The code looks nice and clean, but I have one concern.

The expression delay_base * (2 ** query_state{'failures']) will start overwhelming delay_cap by the fifth failure, and then we'll be computing exponential factors of 2 for no reason, and they can get big, well, exponentially.

I think the better strategy here is to tweak it so that after N failures, we just using the delay_cap value. I also think there is some point after which we should simply quit, but that can be for a future PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showell changed the PR. Now it doesn't compute after 5th power of 2. Regarding stopping do_api_query stops after 10 failures.

Copy link
Collaborator Author

@orientor orientor Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showell should delay_cap and delay_base be added as function parameters so that it can be decided by user?

@orientor orientor changed the title API: Added random exponential backoff to do_api_query API: Add random exponential backoff to do_api_query. Feb 24, 2020
@zulipbot zulipbot added size: M and removed size: S labels Feb 25, 2020
message = "Sleeping for %ss [max %s] before retrying." % (delay, delay_scale * 2)
delay_base = 0.5
delay = random.random() * self.delay_cap
if math.log2(self.delay_cap) > self.number_of_retries:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I think my prior suggestion was a bit misleading. I didn't want us to actually calculate the logarithm at run time; I was just explaining mathematically that after log2N delays, we're always gonna get delay_cap here instead of 2 ** retries, but I didn't mean to actually code it this way.

It's my fault here for probably prematurely optimizing. (Well, I wasn't prematurely optimizing, I was more mistakenly worried about overflow bugs.) I think JS can handle large exponents of 2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@showell changed the code accordingly. The optimization could be added later if needed.

@timabbott
Copy link
Member

Can you provide terminal output showing you've tested the error handling behaves correctly?

@orientor
Copy link
Collaborator Author

Screenshot from 2020-02-27 07-15-53

I have tested the condition when retry is required. But in the original code retry happens only in two cases:

  1. When we receive a 50x status code.
  2. If we have connected to the server before but couldn't connect now.

Hence in ALL other cases (with no retry) my code works same as the original code.(Tested it and also an error gets raised in these cases so changing loop doesn't effect them)

For the first 2 cases the output will be like that shown in the image.

@orientor orientor force-pushed the Exponential branch 3 times, most recently from 91cd715 to 967487a Compare March 1, 2020 09:24
@orientor orientor changed the title API: Add random exponential backoff to do_api_query. api: Add random exponential backoff to do_api_query. Mar 1, 2020
@orientor
Copy link
Collaborator Author

orientor commented Mar 2, 2020

I also checked if it works normally in normal conditions.
Screenshot from 2020-03-03 02-19-31

Screenshot from 2020-03-03 02-19-50

@orientor
Copy link
Collaborator Author

orientor commented Mar 2, 2020

Some of the other errors:

Screenshot from 2020-03-03 02-20-12
Screenshot from 2020-03-03 02-22-45

@orientor
Copy link
Collaborator Author

orientor commented Mar 2, 2020

I have thoroughly checked most of the cases. Code working fine in all.

@timabbott
Copy link
Member

@orientor can you rework your commits here to follow our commit style? I think what makes sense are:

  • A first commit that adjusts the behavior of RandomExponentialBackoff to add a delay_cap, with a default value of 90 (10 is too short).
  • A second commit that does the delay_base refactor (nonfunctional).
  • A third commit that changes the randomization approach in RandomExponentialBackoff, which we might choose to skip.
  • A next commit that migrates the error_retry logic to use RandomExponentialBackoff
  • Etc.

That approach will be reviewable, biseactable, and mergable.

@orientor
Copy link
Collaborator Author

orientor commented Mar 4, 2020

@timabbott
I have reworked my changes into two commits.
In the first commit I am improving the Random Exponential Backoff class.
In the second commit I am updating the do_api_query method to use Random Exponential Backoff.

Now going to update it to 4.

@orientor
Copy link
Collaborator Author

orientor commented Mar 4, 2020

@timabbott removed delay_base completely as it changed only one iteration(it was set to 1/2 so it would basically reduce power of 2 by 1) and I think was not required.

Reworked changes into 3 commits.

  1. Added delay_cap as class variable to the CountingBackoff class, the superclass for RandomExponentialBackoff. Changed the default delay_cap to 90.

  2. Changed the algorithm for Random Exponential Backoff.

  3. Changed do_api_query so as to use Random Exponential backoff.

@timabbott
Copy link
Member

The first commit adds the delay_cap field but doesn't have it do anything. That isn't a coherent commit -- it's just confusing were to merge just that.

Zulip's development model is around every commit in every PR being mergable incrementally, which is important for bisecting as well as efficient code review.

Can you fix the first commit to fully implement delay_cap?

@orientor
Copy link
Collaborator Author

orientor commented Mar 5, 2020

@timabbott Improved the commits.
The first commit now basically adds delay_cap properly, with the Random Exponential Backoff now using delay_cap.
The second commit improves Random Exponential Backoff algorithm.
The third commit adds Random Exponential Backoff to do_api_query method.

@@ -526,9 +523,6 @@ def error_retry(error_string):
sys.stdout.write(".")
sys.stdout.flush()
query_state["request"]["dont_block"] = json.dumps(True)
time.sleep(1)
query_state["failures"] += 1
return True
Copy link
Member

@timabbott timabbott Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just have this function do the backoff.fail() and return backoff.keep_going() ? It's deduplicate code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice idea. Implemented this and one other change and now my do_api_query commit is way more readable and has minimal changes.

@@ -611,6 +610,7 @@ def end_error_retry(succeeded):
end_error_retry(False)
return {'msg': "Unexpected error from the server", "result": "http-error",
"status_code": res.status_code}
return {'msg': "Unexpected error from the server", "result": "unexpected-error"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed after previous changes.

@timabbott
Copy link
Member

I merged the first commit and posted a couple comments. I'll also note that this doesn't replace the time.sleep(1) in call_on_each_event, nor does it add a change to detect "invalid authentication" failures and not retry those, which I think we should do.

@orientor orientor force-pushed the Exponential branch 2 times, most recently from 4fc8183 to 62f6f7c Compare March 6, 2020 18:51
@zulipbot zulipbot added size: S and removed size: M labels Mar 6, 2020
@orientor
Copy link
Collaborator Author

orientor commented Mar 6, 2020

@timabbott Changed the code to be more readable and have minimal changes. I was not aware of call_on_each_event. Will make the necessary commits soon. Also can you shed some light on invalid authentication? According to me the only way of having invalid authentication is wrong API key. Are there any other methods? I will make the changes accordingly.

@orientor
Copy link
Collaborator Author

orientor commented Mar 15, 2020

Added exponential backoff to call_on_each_event. But the function was originally intended to keep retrying to connect the server until successful. So an infinite loop. @timabbott @showell Is that intended or should I add a max number of retries?

@zulipbot
Copy link
Member

Heads up @orientor, we just merged some commits that conflict with the changes your made in this pull request! You can review this repository's recent commits to see where the conflicts occur. Please rebase your feature branch against the upstream/master branch and resolve your pull request's merge conflicts accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Improve exponential backoff implementation (add maximum)
4 participants