Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add High Availability information for s3gw and Longhorn #845

Closed
wants to merge 1 commit into from

Conversation

giubacc
Copy link

@giubacc giubacc commented Nov 28, 2023

add High Availability information for s3gw and Longhorn

Fixes: https://github.com/aquarist-labs/s3gw/issues/841

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • CHANGELOG.md has been updated should there be relevant changes in this PR.

@giubacc giubacc self-assigned this Nov 28, 2023
@giubacc giubacc added kind/documentation Improvements or additions to documentation priority/0 Needs to go into the next release or force a patch labels Nov 28, 2023
@giubacc giubacc added this to the v0.24.0 milestone Nov 28, 2023
docs/high-availability.md Outdated Show resolved Hide resolved
@giubacc giubacc force-pushed the docs-current-HA-s3gw-LH branch from 1fd7299 to 126b7a0 Compare November 29, 2023 14:13
@giubacc giubacc marked this pull request as ready for review November 29, 2023 14:17
Comment on lines +18 to +19
s3gw can reasonably protect against; that's all undefined behavior and "restore
from backup" time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the that's all undefined behavior ..., or maybe replace it with just In this case, restoring from backup might be the only option..

s3gw can reasonably protect against; that's all undefined behavior and "restore
from backup" time.

The *Active/Standby* model claims the following characteristics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/claims/offers/

This has the advantage of being schedulable, so it can happen at times of low load
if these exist.

When any of these scenarios should happen, Kubernetes restarts the s3gw pod and we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/should happen/happens/

Comment on lines +47 to +48
writing this), does not automatically restart a pod attached to a RWO volume
in the event that the node running it suffers a failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write "in case the node it is running on suffers from a failure"

Currently, Kubernetes ([1.28](https://kubernetes.io/releases/) at the time of
writing this), does not automatically restart a pod attached to a RWO volume
in the event that the node running it suffers a failure.
Reasons behind this behavior is that workloads, such as RWO volumes require
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comma between "RWO volumes" and "require"?

Failures affecting these kind of workloads risk data loss and/or corruption
if nodes (and the workloads running on them) are wrongly assumed to be dead.
For this reason it is crucial to know that the node has reached a safe state
before initiating recovery of the workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/recovery of the workload/workload recovery/

Comment on lines +56 to +57
Longhorn offers the option to perform a [Pod Deletion Policy][pod-deletion-policy]
when a node should go down unexpectedly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd write instead

offers the option to define a Pod Deletion Policy when the node goes down unexpectedly


Longhorn offers the option to perform a [Pod Deletion Policy][pod-deletion-policy]
when a node should go down unexpectedly.
This means that Longhorn will force delete StatefulSet/Deployment terminating pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean "force-delete" or something? or maybe "forcefully delete"?

on nodes that are down to release Longhorn volumes so that Kubernetes
can spin up replacement pods.

Anyway, when employing this mitigation, the user must be aware that assuming a node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe drop the "Anyway" and start with "When employing ..." ?


The s3gw and the Longhorn team is currently investigating some
[hypotheses of solutions][longhorn-issue-1]
to address this problem at its roots.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/roots/root/

@jecluis jecluis removed this from the v0.24.0 milestone Mar 21, 2024
@jecluis jecluis closed this Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Improvements or additions to documentation priority/0 Needs to go into the next release or force a patch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants