Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "2ha.sh" script, managing 2-node Canonical K8s HA AA clusters #692

Merged
merged 9 commits into from
Oct 2, 2024

Conversation

petrutlucian94
Copy link
Contributor

@petrutlucian94 petrutlucian94 commented Sep 23, 2024

Scenario overview:

  • Canonical K8s cluster containing 2 nodes
  • Dqlite data store (unable to obtain quorum)
  • Primary node dqlite files stored on DRBD
    • sync block-level replication between the two nodes
  • cluster monitoring and failover handled through Pacemaker

Script functionality:

  • boostrap the service
    • wait for a DRBD primary to be elected
    • detect the node role based on the DRBD status and Dqlite state
      • have the replica wait for the primary to be ready before continuing
    • recover Dqlite after failovers
    • transfer and apply recovery files to secondary nodes
    • transfer Dqlite files to DRBD and other backup locations, creating necessary symlinks
  • install required packages
  • purge all K8s data
  • clear Pacemaker taints
  • remove recovery data

"2ha.sh start_service" is intended to be used as part of a systemd unit that bootstraps the k8s services, coordinating with the other node and taking any necessary steps to recover Dqlite.

This PR also adds a "how-to" guide for the 2-node A-A HA setup.

Scenario overview:

* Canonical K8s cluster containing 2 nodes
* Dqlite data store (unable to obtain quorum)
* Primary node dqlite files stored on DRBD
  * sync block-level replication between the two nodes
* cluster monitoring and failover handled through Pacemaker

Script functionality:

* boostrap the service
  * wait for a DRBD primary to be elected
  * detect the node role based on the DRBD status and Dqlite state
    * have the replica wait for the primary to be ready before continuing
  * recover Dqlite after failovers
  * transfer and apply recovery files to secondary nodes
  * transfer Dqlite files to DRBD and other backup locations, creating
    necessary symlinks
* install required packages
* purge all K8s data
* clear Pacemaker taints
* remove recovery data

"2ha.sh start_service" is intended to be used as part of a systemd
unit that bootstraps the k8s services, coordinating with the other
node and taking any necessary steps to recover Dqlite.
@petrutlucian94 petrutlucian94 requested a review from a team as a code owner September 23, 2024 13:46
We're adding a guide that covers the 2-node A-A HA scenario.
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass. Consider the rephrasing as suggestions and feel free to ignore them

The script looks mostly fine (you already know my opinion on large bash scripts. Fine for now but should eventually be moved to Python or Go IMHO)

docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
k8s/hack/2ha.sh Outdated Show resolved Hide resolved
@petrutlucian94
Copy link
Contributor Author

Thanks for reviewing this PR! I'll address the comments right away.

The script looks mostly fine (you already know my opinion on large bash scripts. Fine for now but should eventually be moved to Python or Go IMHO)

I admit that Openstack Devstack changed my perception of what a large bash script means but I see your point.

@petrutlucian94 petrutlucian94 force-pushed the KU-1606/2ha_script branch 3 times, most recently from 57bd703 to aa01e0e Compare September 25, 2024 13:02
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work on the 2-node HA set-up @petrutlucian94!
Please iterate over my polishing comments.
I am requesting changes because I would like to discuss the alternative solution with PostgreSQL.

docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/2-node-ha.md Outdated Show resolved Hide resolved
@petrutlucian94
Copy link
Contributor Author

@louiseschmidtgen Thanks for reviewing the docs! I've addressed most comments and left a few questions.

Copy link
Contributor

@louiseschmidtgen louiseschmidtgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple more small comments that need to be addressed, afterwards we are good to go. Thank you @petrutlucian94

docs/src/snap/howto/two-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/two-node-ha.md Show resolved Hide resolved
docs/src/snap/howto/two-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/two-node-ha.md Outdated Show resolved Hide resolved
docs/src/snap/howto/two-node-ha.md Outdated Show resolved Hide resolved
k8s/hack/two-node-ha.sh Outdated Show resolved Hide resolved
k8s/hack/two-node-ha.sh Outdated Show resolved Hide resolved
k8s/hack/two-node-ha.sh Show resolved Hide resolved
k8s/hack/two-node-ha.sh Outdated Show resolved Hide resolved
k8s/hack/two-node-ha.sh Outdated Show resolved Hide resolved
@bschimke95
Copy link
Contributor

Great work @petrutlucian94!

@bschimke95 bschimke95 merged commit 5af076a into canonical:main Oct 2, 2024
18 of 19 checks passed
evilnick pushed a commit to evilnick/k8s-snap that referenced this pull request Nov 14, 2024
…nonical#692)

Scenario overview:

* Canonical K8s cluster containing 2 nodes
* Dqlite data store (unable to obtain quorum)
* Primary node dqlite files stored on DRBD
  * sync block-level replication between the two nodes
* cluster monitoring and failover handled through Pacemaker

Script functionality:

* boostrap the service
  * wait for a DRBD primary to be elected
  * detect the node role based on the DRBD status and Dqlite state
    * have the replica wait for the primary to be ready before continuing
  * recover Dqlite after failovers
  * transfer and apply recovery files to secondary nodes
  * transfer Dqlite files to DRBD and other backup locations, creating
    necessary symlinks
* install required packages
* purge all K8s data
* clear Pacemaker taints
* remove recovery data

"2ha.sh start_service" is intended to be used as part of a systemd
unit that bootstraps the k8s services, coordinating with the other
node and taking any necessary steps to recover Dqlite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants