Add "2ha.sh" script, managing 2-node Canonical K8s HA AA clusters #692

petrutlucian94 · 2024-09-23T13:46:34Z

Scenario overview:

Canonical K8s cluster containing 2 nodes
Dqlite data store (unable to obtain quorum)
Primary node dqlite files stored on DRBD
- sync block-level replication between the two nodes
cluster monitoring and failover handled through Pacemaker

Script functionality:

boostrap the service
- wait for a DRBD primary to be elected
- detect the node role based on the DRBD status and Dqlite state
  - have the replica wait for the primary to be ready before continuing
- recover Dqlite after failovers
- transfer and apply recovery files to secondary nodes
- transfer Dqlite files to DRBD and other backup locations, creating necessary symlinks
install required packages
purge all K8s data
clear Pacemaker taints
remove recovery data

"2ha.sh start_service" is intended to be used as part of a systemd unit that bootstraps the k8s services, coordinating with the other node and taking any necessary steps to recover Dqlite.

This PR also adds a "how-to" guide for the 2-node A-A HA setup.

Scenario overview: * Canonical K8s cluster containing 2 nodes * Dqlite data store (unable to obtain quorum) * Primary node dqlite files stored on DRBD * sync block-level replication between the two nodes * cluster monitoring and failover handled through Pacemaker Script functionality: * boostrap the service * wait for a DRBD primary to be elected * detect the node role based on the DRBD status and Dqlite state * have the replica wait for the primary to be ready before continuing * recover Dqlite after failovers * transfer and apply recovery files to secondary nodes * transfer Dqlite files to DRBD and other backup locations, creating necessary symlinks * install required packages * purge all K8s data * clear Pacemaker taints * remove recovery data "2ha.sh start_service" is intended to be used as part of a systemd unit that bootstraps the k8s services, coordinating with the other node and taking any necessary steps to recover Dqlite.

We're adding a guide that covers the 2-node A-A HA scenario.

bschimke95

Did an initial pass. Consider the rephrasing as suggestions and feel free to ignore them

The script looks mostly fine (you already know my opinion on large bash scripts. Fine for now but should eventually be moved to Python or Go IMHO)

docs/src/snap/howto/2-node-ha.md

k8s/hack/2ha.sh

petrutlucian94 · 2024-09-25T08:17:24Z

Thanks for reviewing this PR! I'll address the comments right away.

The script looks mostly fine (you already know my opinion on large bash scripts. Fine for now but should eventually be moved to Python or Go IMHO)

I admit that Openstack Devstack changed my perception of what a large bash script means but I see your point.

bschimke95

LGTM

louiseschmidtgen

Amazing work on the 2-node HA set-up @petrutlucian94!
Please iterate over my polishing comments.
I am requesting changes because I would like to discuss the alternative solution with PostgreSQL.

docs/src/snap/howto/2-node-ha.md

petrutlucian94 · 2024-09-30T07:47:16Z

@louiseschmidtgen Thanks for reviewing the docs! I've addressed most comments and left a few questions.

docs/src/snap/howto/two-node-ha.md

louiseschmidtgen

Couple more small comments that need to be addressed, afterwards we are good to go. Thank you @petrutlucian94

docs/src/snap/howto/two-node-ha.md

k8s/hack/two-node-ha.sh

bschimke95 · 2024-10-02T08:46:46Z

Great work @petrutlucian94!

…nonical#692) Scenario overview: * Canonical K8s cluster containing 2 nodes * Dqlite data store (unable to obtain quorum) * Primary node dqlite files stored on DRBD * sync block-level replication between the two nodes * cluster monitoring and failover handled through Pacemaker Script functionality: * boostrap the service * wait for a DRBD primary to be elected * detect the node role based on the DRBD status and Dqlite state * have the replica wait for the primary to be ready before continuing * recover Dqlite after failovers * transfer and apply recovery files to secondary nodes * transfer Dqlite files to DRBD and other backup locations, creating necessary symlinks * install required packages * purge all K8s data * clear Pacemaker taints * remove recovery data "2ha.sh start_service" is intended to be used as part of a systemd unit that bootstraps the k8s services, coordinating with the other node and taking any necessary steps to recover Dqlite.

petrutlucian94 requested a review from a team as a code owner September 23, 2024 13:46

Add 2-node HA guide

a4e8828

We're adding a guide that covers the 2-node A-A HA scenario.

bschimke95 reviewed Sep 25, 2024

View reviewed changes

petrutlucian94 force-pushed the KU-1606/2ha_script branch 3 times, most recently from 57bd703 to aa01e0e Compare September 25, 2024 13:02

Update docs as per PR comments

33ed437

petrutlucian94 force-pushed the KU-1606/2ha_script branch from aa01e0e to 33ed437 Compare September 25, 2024 13:08

bschimke95 approved these changes Sep 25, 2024

View reviewed changes

louiseschmidtgen requested changes Sep 26, 2024

View reviewed changes

petrutlucian94 added 3 commits September 30, 2024 09:50

Rename 2ha.sh to two-node-ha.sh

0f8bbca

s/2-node/two-node

b1925f2

Address comments

8d690c3

petrutlucian94 force-pushed the KU-1606/2ha_script branch from 30d0367 to 8d690c3 Compare September 30, 2024 07:36

louiseschmidtgen reviewed Sep 30, 2024

View reviewed changes

docs/src/snap/howto/two-node-ha.md Outdated Show resolved Hide resolved

Remove empty lines and add separate note about the A-A cluster

97c287b

louiseschmidtgen approved these changes Sep 30, 2024

View reviewed changes

petrutlucian94 added 2 commits September 30, 2024 16:33

Address comments

3729dc2

add warning to troubleshooting section

1136524

bschimke95 merged commit 5af076a into canonical:main Oct 2, 2024
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "2ha.sh" script, managing 2-node Canonical K8s HA AA clusters #692

Add "2ha.sh" script, managing 2-node Canonical K8s HA AA clusters #692

petrutlucian94 commented Sep 23, 2024 •

edited

Loading

bschimke95 left a comment

petrutlucian94 commented Sep 25, 2024

bschimke95 left a comment

louiseschmidtgen left a comment

petrutlucian94 commented Sep 30, 2024

louiseschmidtgen left a comment •

edited

Loading

bschimke95 commented Oct 2, 2024

Add "2ha.sh" script, managing 2-node Canonical K8s HA AA clusters #692

Add "2ha.sh" script, managing 2-node Canonical K8s HA AA clusters #692

Conversation

petrutlucian94 commented Sep 23, 2024 • edited Loading

bschimke95 left a comment

Choose a reason for hiding this comment

petrutlucian94 commented Sep 25, 2024

bschimke95 left a comment

Choose a reason for hiding this comment

louiseschmidtgen left a comment

Choose a reason for hiding this comment

petrutlucian94 commented Sep 30, 2024

louiseschmidtgen left a comment • edited Loading

Choose a reason for hiding this comment

bschimke95 commented Oct 2, 2024

petrutlucian94 commented Sep 23, 2024 •

edited

Loading

louiseschmidtgen left a comment •

edited

Loading