k8sd cluster-recover: add non-interactive mode (#662)

At the moment, the "k8sd cluster-recover" displays interactive prompts and text editors that assist the user in updating the dqlite configuration. We need to be able to run the command non-interactively in order to automate the cluster recovery procedure. This change adds a "--non-interactive" flag. If set, we'll no longer show confirmation prompts and we'll assume that the configuration files have already been updated, proceeding with the dqlite recovery.
canonical · Sep 16, 2024 · a47d342 · a47d342
1 parent 522a161
commit a47d342
Show file tree

Hide file tree

Showing 2 changed files with 183 additions and 123 deletions.
diff --git a/docs/src/snap/howto/restore-quorum.md b/docs/src/snap/howto/restore-quorum.md
@@ -1,9 +1,9 @@
 # Recovering a Cluster After Quorum Loss
 
 Highly available {{product}} clusters can survive losing one or more
-nodes. [Dqlite], the default datastore, implements a [Raft] based protocol where
-an elected leader holds the definitive copy of the database, which is then
-replicated on two or more secondary nodes.
+nodes. [Dqlite], the default datastore, implements a [Raft] based protocol
+where an elected leader holds the definitive copy of the database, which is
+then replicated on two or more secondary nodes.
 
 When the a majority of the nodes are lost, the cluster becomes unavailable.
 If at least one database node survived, the cluster can be recovered using the
@@ -64,16 +64,17 @@ sudo snap stop k8s
 
 ## Recover the Database
 
-Choose one of the remaining alive cluster nodes that has the most recent version
-of the Raft log.
+Choose one of the remaining alive cluster nodes that has the most recent
+version of the Raft log.
 
 Update the ``cluster.yaml`` files, changing the role of the lost nodes to
 "spare" (2). Additionally, double check the addresses and IDs specified in
 ``cluster.yaml``, ``info.yaml`` and ``daemon.yaml``, especially if database
 files were moved across nodes.
 
 The following command guides us through the recovery process, prompting a text
-editor with informative inline comments for each of the dqlite configuration files.
+editor with informative inline comments for each of the dqlite configuration
+files.
 
 ```
 sudo /snap/k8s/current/bin/k8sd cluster-recover \
@@ -82,29 +83,40 @@ sudo /snap/k8s/current/bin/k8sd cluster-recover \
     --log-level 0
 ```
 
-Please adjust the log level for additional debug messages by increasing its value.
-The command creates database backups before making any changes.
+Please adjust the log level for additional debug messages by increasing its
+value. The command creates database backups before making any changes.
 
-The above command will reconfigure the Raft members and create recovery tarballs
-that are used to restore the lost nodes, once the Dqlite configuration is updated.
+The above command will reconfigure the Raft members and create recovery
+tarballs that are used to restore the lost nodes, once the Dqlite
+configuration is updated.
 
 ```{note}
-By default, the command will recover both Dqlite databases. If one of the databases
-needs to be skipped, use the ``--skip-k8sd`` or ``--skip-k8s-dqlite`` flags.
-This can be useful when using an external Etcd database.
+By default, the command will recover both Dqlite databases. If one of the
+databases needs to be skipped, use the ``--skip-k8sd`` or ``--skip-k8s-dqlite``
+flags. This can be useful when using an external Etcd database.
 ```
 
-Once the "cluster-recover" command completes, restart the k8s services on the node:
+```{note}
+Non-interactive mode can be requested using the ``--non-interactive`` flag.
+In this case, no interactive prompts or text editors will be displayed and
+the command will assume that the configuration files have already been updated.
+
+This allows automating the recovery procedure.
+```
+
+Once the "cluster-recover" command completes, restart the k8s services on the
+node:
 
 ```
 sudo snap start k8s
 ```
 
-Ensure that the services started successfully by using ``sudo snap services k8s``.
-Use ``k8s status --wait-ready`` to wait for the cluster to become ready.
+Ensure that the services started successfully by using
+``sudo snap services k8s``. Use ``k8s status --wait-ready`` to wait for the
+cluster to become ready.
 
-You may notice that we have not returned to an HA cluster yet: ``high availability: no``.
-This is expected as we need to recover
+You may notice that we have not returned to an HA cluster yet:
+``high availability: no``. This is expected as we need to recover
 
 ## Recover the remaining nodes
 
@@ -113,28 +125,34 @@ nodes.
 
 For k8sd, copy ``recovery_db.tar.gz`` to
 ``/var/snap/k8s/common/var/lib/k8sd/state/recovery_db.tar.gz``. When the k8sd
-service starts, it will load the archive and perform the necessary recovery steps.
+service starts, it will load the archive and perform the necessary recovery
+steps.
 
 The k8s-dqlite archive needs to be extracted manually. First, create a backup
 of the current k8s-dqlite state directory:
 
 ```
-sudo mv /var/snap/k8s/common/var/lib/k8s-dqlite /var/snap/k8s/common/var/lib/k8s-dqlite.bkp
+sudo mv /var/snap/k8s/common/var/lib/k8s-dqlite \
+  /var/snap/k8s/common/var/lib/k8s-dqlite.bkp
 ```
 
 Then, extract the backup archive:
 
 ```
 sudo mkdir /var/snap/k8s/common/var/lib/k8s-dqlite
-sudo tar xf  recovery-k8s-dqlite-$timestamp-post-recovery.tar.gz -C /var/snap/k8s/common/var/lib/k8s-dqlite
+sudo tar xf  recovery-k8s-dqlite-$timestamp-post-recovery.tar.gz \
+  -C /var/snap/k8s/common/var/lib/k8s-dqlite
 ```
 
 Node specific files need to be copied back to the k8s-dqlite state dir:
 
 ```
-sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.crt /var/snap/k8s/common/var/lib/k8s-dqlite
-sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.key /var/snap/k8s/common/var/lib/k8s-dqlite
-sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/info.yaml /var/snap/k8s/common/var/lib/k8s-dqlite
+sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.crt \
+  /var/snap/k8s/common/var/lib/k8s-dqlite
+sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/cluster.key \
+  /var/snap/k8s/common/var/lib/k8s-dqlite
+sudo cp /var/snap/k8s/common/var/lib/k8s-dqlite.bkp/info.yaml \
+  /var/snap/k8s/common/var/lib/k8s-dqlite
 ```
 
 Once these steps are completed, restart the k8s services:
@@ -143,13 +161,15 @@ Once these steps are completed, restart the k8s services:
 sudo snap start k8s
 ```
 
-Repeat these steps for all remaining nodes. Once a quorum is achieved, the cluster
-will be reported as "highly available":
+Repeat these steps for all remaining nodes. Once a quorum is achieved,
+the cluster will be reported as "highly available":
 
 ```
 $ sudo k8s status
 cluster status:           ready
-control plane nodes:      10.80.130.168:6400 (voter), 10.80.130.167:6400 (voter), 10.80.130.164:6400 (voter)
+control plane nodes:      10.80.130.168:6400 (voter),
+                          10.80.130.167:6400 (voter),
+                          10.80.130.164:6400 (voter)
 high availability:        yes
 datastore:                k8s-dqlite
 network:                  enabled