Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry.WithBackoff: QuickContextExit, Documentation, Tests #69

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

oxzi
Copy link
Member

@oxzi oxzi commented Sep 4, 2024

The retry.WithBackoff function only loosely respected the provided context. In particular, the RetryableFunc could survive the context termination and block as long as it desires.

For situations where timing is critical and a deadline bound context is used, this behavior might be undesirable. That's why the new Settings.QuickContextExit option has been introduced, which effectively wraps time-consuming, blocking functions with an additional context check and bails out directly when the context is done.

This change was reflected in the documentation, which was generally extended, trying to explain more about the implementation details.

While working on the function, some minor refactorings were done. Notable is the simplified check if the context has finished.

Finally, tests were written for retry.WithBackoff to verify the expected behavior and ease future changes.

@oxzi oxzi added enhancement New feature or request tests Adding tests to existing code labels Sep 4, 2024
@oxzi oxzi added this to the 0.4.0 milestone Sep 4, 2024
@cla-bot cla-bot bot added the cla/signed CLA is signed by all contributors of a PR label Sep 4, 2024
oxzi added a commit to Icinga/icingadb that referenced this pull request Sep 5, 2024
A strange HA behavior was reported in [icingadb-787], resulting in both
instances being active.

The logs contained an entry of the previous active instance exiting the
HA.realize() method successfully after 1m9s. This, however, should not
be possible as the method's context is deadlined to a minute after the
heartbeat was received.

With introducing Settings.QuickContextExit for retry.WithBackoff in
[icinga-go-library-69] and using it here, the function directly returns
a context.DeadlineExceeded error the moment the context has expired.
Doing so allows directly handing over, while the other instance can now
take over due to the expired heartbeat in the database.

As an related change, the HA.insertEnvironment() method was inlined into
the retryable function to use the deadlined context. Otherwise, this
might block afterwards, as it was used within HA.realize(), but without
the passed context.

In addition, the main loop select cases for hactx.Done() and ctx.Done()
were unified, as hactx is a derived ctx. A closed ctx case may be lost
as the hactx case could have been chosen.

[icinga-go-library-69]: Icinga/icinga-go-library#69
[icingadb-787]: #787
oxzi added a commit to Icinga/icingadb that referenced this pull request Sep 5, 2024
A strange HA behavior was reported in [icingadb-787], resulting in both
instances being active.

The logs contained an entry of the previous active instance exiting the
HA.realize() method successfully after 1m9s. This, however, should not
be possible as the method's context is deadlined to a minute after the
heartbeat was received.

With introducing Settings.QuickContextExit for retry.WithBackoff in
[icinga-go-library-69] and using it here, the function directly returns
a context.DeadlineExceeded error the moment the context has expired.
Doing so allows directly handing over, while the other instance can now
take over due to the expired heartbeat in the database.

As a related change, the HA.insertEnvironment() method was inlined into
the retryable function to use the deadlined context. Otherwise, this
might block afterwards, as it was used within HA.realize(), but without
the passed context.

In addition, the main loop select cases for hactx.Done() and ctx.Done()
were unified, as hactx is a derived ctx. A closed ctx case may be lost
as the hactx case could have been chosen.

[icinga-go-library-69]: Icinga/icinga-go-library#69
[icingadb-787]: #787
@oxzi oxzi requested a review from Al2Klimov September 6, 2024 08:44
retry/retry.go Outdated Show resolved Hide resolved
Comment on lines +123 to +141
err = funcWrapper(func(ctx context.Context) error {
err := retryableFunc(ctx)
if err == nil {
if settings.OnSuccess != nil {
settings.OnSuccess(time.Since(start), attempt, prevErr)
}
}

return err
})(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of calling OnSuccess within the wrapped function so that this callback couldn't delay the return of this function either?

However, this seems to move calling the callbacks to another goroutine, which might be surprising and could introduce race conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Theoretically, both callback functions may block as well, while, unfortunately, they do not even receive a context to be bound to.

I am a bit unhappy with this architecture, but at least it does not break the API and needs a check for every retry.WithBackoff call. Another option would be to exclude the callbacks from the context-wrapper and stating this in the documentation. In most cases, the callbacks are only used for logging which should not block.

retry/retry.go Outdated Show resolved Hide resolved
retry/retry.go Outdated Show resolved Hide resolved
retry/retry.go Outdated Show resolved Hide resolved
oxzi added a commit to Icinga/icingadb that referenced this pull request Sep 10, 2024
A strange HA behavior was reported in [icingadb-787], resulting in both
instances being active.

The logs contained an entry of the previous active instance exiting the
HA.realize() method successfully after 1m9s. This, however, should not
be possible as the method's context is deadlined to a minute after the
heartbeat was received.

With introducing Settings.QuickContextExit for retry.WithBackoff in
[icinga-go-library-69] and using it here, the function directly returns
a context.DeadlineExceeded error the moment the context has expired.
Doing so allows directly handing over, while the other instance can now
take over due to the expired heartbeat in the database.

As a related change, the HA.insertEnvironment() method was inlined into
the retryable function to use the deadlined context. Otherwise, this
might block afterwards, as it was used within HA.realize(), but without
the passed context.

In addition, the main loop select cases for hactx.Done() and ctx.Done()
were unified, as hactx is a derived ctx. A closed ctx case may be lost
as the hactx case could have been chosen.

[icinga-go-library-69]: Icinga/icinga-go-library#69
[icingadb-787]: #787
The retry.WithBackoff function only loosely respected the provided
context. In particular, the RetryableFunc could survive the context
termination and block as long as it desires.

For situations where timing is critical and a deadline bound context is
used, this behavior might be undesirable. That's why the new
Settings.QuickContextExit option has been introduced, which effectively
wraps time-consuming, blocking functions with an additional context
check and bails out directly when the context is done.

This change was reflected in the documentation, which was generally
extended, trying to explain more about the implementation details.

While working on the function, some minor refactorings were done.
Notable is the simplified check if the context has finished.

Finally, tests were written for retry.WithBackoff to verify the expected
behavior and ease future changes.
@oxzi oxzi force-pushed the retry-withbackoff-quickcontextexit branch from ec93b83 to 45adf81 Compare September 10, 2024 11:41
@oxzi oxzi marked this pull request as draft September 23, 2024 07:40
@yhabteab
Copy link
Member

Since Icinga/icingadb#800 doesn't make use of this, is it still necessary? If we still intend to include it, it can be shipped in later releases, but it's not required for the 0.4.0 release. Thus, please close or drop it from the milestone based on your decision.

@oxzi oxzi removed this from the 0.4.0 milestone Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla/signed CLA is signed by all contributors of a PR enhancement New feature or request tests Adding tests to existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants