Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort HA Realization Logic After Timeout #800

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

oxzi
Copy link
Member

@oxzi oxzi commented Sep 5, 2024

A strange HA behavior was reported in #787, resulting in both instances being active.

The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received.

With introducing Settings.QuickContextExit for retry.WithBackoff in Icinga/icinga-go-library#69 and using it here, the function directly returns a context.DeadlineExceeded error the moment the context has expired. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database.

As a related change, the HA.insertEnvironment() method was inlined into the retryable function to use the deadlined context. Otherwise, this might block afterwards, as it was used within HA.realize(), but without the passed context.

In addition, the main loop select cases for hactx.Done() and ctx.Done() were unified, as hactx is a derived ctx. A closed ctx case may be lost as the hactx case could have been chosen.


This PR was marked as a draft as it requires Icinga/icinga-go-library#69 and a bumped icinga-go-library version.

@oxzi oxzi added the bug Something isn't working label Sep 5, 2024
@oxzi oxzi added this to the 1.3.0 milestone Sep 5, 2024
@cla-bot cla-bot bot added the cla/signed label Sep 5, 2024
pkg/icingadb/ha.go Outdated Show resolved Hide resolved
A strange HA behavior was reported in [icingadb-787], resulting in both
instances being active.

The logs contained an entry of the previous active instance exiting the
HA.realize() method successfully after 1m9s. This, however, should not
be possible as the method's context is deadlined to a minute after the
heartbeat was received.

With introducing Settings.QuickContextExit for retry.WithBackoff in
[icinga-go-library-69] and using it here, the function directly returns
a context.DeadlineExceeded error the moment the context has expired.
Doing so allows directly handing over, while the other instance can now
take over due to the expired heartbeat in the database.

As a related change, the HA.insertEnvironment() method was inlined into
the retryable function to use the deadlined context. Otherwise, this
might block afterwards, as it was used within HA.realize(), but without
the passed context.

In addition, the main loop select cases for hactx.Done() and ctx.Done()
were unified, as hactx is a derived ctx. A closed ctx case may be lost
as the hactx case could have been chosen.

[icinga-go-library-69]: Icinga/icinga-go-library#69
[icingadb-787]: #787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla/signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants