Skip to content

Commit

Permalink
Added support to perform cluster promotion/demotion
Browse files Browse the repository at this point in the history
Signed-off-by: Utkarsh Bhatt <[email protected]>
  • Loading branch information
UtkarshBhatthere committed Oct 14, 2024
1 parent 38f0840 commit 94ae9b5
Show file tree
Hide file tree
Showing 18 changed files with 648 additions and 32 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -709,6 +709,12 @@ jobs:
- name: Verify RBD mirror
run : ~/actionutils.sh remote_verify_rbd_mirroring

- name: Failover site A to Site B
run : ~/actionutils.sh remote_failover_to_siteb

- name: Failback to Site A
run : ~/actionutils.sh remote_failback_to_sitea

- name: Disable RBD mirror
run : ~/actionutils.sh remote_disable_rbd_mirroring

Expand Down
71 changes: 71 additions & 0 deletions docs/how-to/perform-site-failover.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
=============================================
Perform failover for replicated RBD resources
=============================================

In case of a disaster, all replicated RBD pools can be failed over to a non-primary remote.

An operator can perform promotion on a non-primary cluster, this will inturn promote all replicated rbd
images in all rbd pools and make them primary. This enables them to be consumed by vms and other workloads.

Prerequisites
--------------
1. A primary and a secondary MicroCeph cluster, for example named "primary_cluster" and "secondary_cluster"
2. primary_cluster has imported configurations from secondary_cluster and vice versa. refer to :doc:`import remote <./import-remote-cluster>`
3. RBD remote replication is configured for atleast 1 rbd image. refer to :doc:`configure rbd replication <./configure-rbd-mirroring>`

Failover to a non-primary remote cluster
-----------------------------------------
List all the resources on 'secondary_cluster' to check primary status.

.. code-block:: none
sudo microceph remote replication rbd list
+-----------+------------+------------+---------------------+
| POOL NAME | IMAGE NAME | IS PRIMARY | LAST LOCAL UPDATE |
+-----------+------------+------------+---------------------+
| pool_one | image_one | false | 2024-10-14 09:03:17 |
| pool_one | image_two | false | 2024-10-14 09:03:17 |
+-----------+------------+------------+---------------------+
An operator can perform cluster wide promotion as follows:

.. code-block:: none
sudo microceph remote replication rbd promote --remote primary_cluster --force
Here, <remote> paramter helps microceph filter the resources to promote.

Verify RBD remote replication primary status
---------------------------------------------

List all the resources on 'secondary_cluster' again to check primary status.

.. code-block:: none
sudo microceph remote replication rbd status pool_one
+-----------+------------+------------+---------------------+
| POOL NAME | IMAGE NAME | IS PRIMARY | LAST LOCAL UPDATE |
+-----------+------------+------------+---------------------+
| pool_one | image_one | true | 2024-10-14 09:06:12 |
| pool_one | image_two | true | 2024-10-14 09:06:12 |
+-----------+------------+------------+---------------------+
The status shows that there are 2 replicated images and both of them are now primary.

Failback to old primary
------------------------

Once the disaster struck cluster (primary_cluster) is back online the RBD resources
can be failed back to it, but, by this time the RBD images at the current primary (secondary_cluster)
would have diverged from primary_cluster. Thus, to have a clean sync, the operator must decide
which cluster would be demoted to the non-primary status. This cluster will then receive the
RBD mirror updates from the standing primary.

Note: Demotion can cause data loss and hence can only be performed with the 'force' flag.

At primary_cluster (was primary before disaster), perform demotion.
.. code-block:: none
sudo microceph remote replication rbd demote
29 changes: 29 additions & 0 deletions docs/reference/commands/remote-replication-rbd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,3 +96,32 @@ Usage:
--force forcefully disable replication for rbd resource
``promote``
----------

Promote local cluster to primary

.. code-block:: none
microceph remote replication rbd promote [flags]
.. code-block:: none
--remote remote MicroCeph cluster name
--force forcefully promote site to primary
``demote``
------------

Demote local cluster to secondary

Usage:

.. code-block:: none
microceph remote replication rbd demote [flags]
.. code-block:: none
--remote remote MicroCeph cluster name
10 changes: 10 additions & 0 deletions microceph/api/ops_replication.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ var opsReplicationCmd = rest.Endpoint{
var opsReplicationWorkloadCmd = rest.Endpoint{
Path: "ops/replication/{wl}",
Get: rest.EndpointAction{Handler: getOpsReplicationWorkload, ProxyTarget: false},
Put: rest.EndpointAction{Handler: putOpsReplicationWorkload, ProxyTarget: false},
}

// CRUD Replication
Expand All @@ -47,6 +48,12 @@ func getOpsReplicationWorkload(s state.State, r *http.Request) response.Response
return cmdOpsReplication(s, r, types.ListReplicationRequest)
}

// putOpsReplicationWorkload handles list operation
func putOpsReplicationWorkload(s state.State, r *http.Request) response.Response {
// either promote or demote (already encoded in request)
return cmdOpsReplication(s, r, "")
}

// getOpsReplicationResource handles status operation for a certain resource.
func getOpsReplicationResource(s state.State, r *http.Request) response.Response {
return cmdOpsReplication(s, r, types.StatusReplicationRequest)
Expand Down Expand Up @@ -104,6 +111,9 @@ func cmdOpsReplication(s state.State, r *http.Request, patchRequest types.Replic
return response.SmartError(fmt.Errorf(""))
}

// TODO: convert this to debug
logger.Infof("REPOPS: %s received for %s: %s", req.GetWorkloadRequestType(), wl, resource)

return handleReplicationRequest(s, r.Context(), req)
}

Expand Down
10 changes: 7 additions & 3 deletions microceph/api/types/replication.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,13 @@ type ReplicationRequestType string
const (
EnableReplicationRequest ReplicationRequestType = "POST-" + constants.EventEnableReplication
ConfigureReplicationRequest ReplicationRequestType = "PUT-" + constants.EventConfigureReplication
DisableReplicationRequest ReplicationRequestType = "DELETE-" + constants.EventDisableReplication
StatusReplicationRequest ReplicationRequestType = "GET-" + constants.EventStatusReplication
ListReplicationRequest ReplicationRequestType = "GET-" + constants.EventListReplication
PromoteReplicationRequest ReplicationRequestType = "PUT-" + constants.EventPromoteReplication
DemoteReplicationRequest ReplicationRequestType = "PUT-" + constants.EventDemoteReplication
// Delete Requests
DisableReplicationRequest ReplicationRequestType = "DELETE-" + constants.EventDisableReplication
// Get Requests
StatusReplicationRequest ReplicationRequestType = "GET-" + constants.EventStatusReplication
ListReplicationRequest ReplicationRequestType = "GET-" + constants.EventListReplication
)

type CephWorkloadType string
Expand Down
120 changes: 118 additions & 2 deletions microceph/ceph/rbd_mirror.go
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,8 @@ func DisablePoolMirroring(pool string, peer RbdReplicationPeer, localName string
return nil
}

// DisableMirroringAllImagesInPool disables mirroring for all images for a pool enabled in pool mirroring mode.
func DisableMirroringAllImagesInPool(poolName string) error {
// DisableAllMirroringImagesInPool disables mirroring for all images for a pool enabled in pool mirroring mode.
func DisableAllMirroringImagesInPool(poolName string) error {
poolStatus, err := GetRbdMirrorVerbosePoolStatus(poolName, "", "")
if err != nil {
err := fmt.Errorf("failed to fetch status for %s pool: %v", poolName, err)
Expand All @@ -233,6 +233,28 @@ func DisableMirroringAllImagesInPool(poolName string) error {
return nil
}

// DisableMirroringAllImagesInPool disables mirroring for all images for a pool enabled in pool mirroring mode.
func ResyncAllMirroringImagesInPool(poolName string) error {
poolStatus, err := GetRbdMirrorVerbosePoolStatus(poolName, "", "")
if err != nil {
err := fmt.Errorf("failed to fetch status for %s pool: %v", poolName, err)
logger.Error(err.Error())
return err
}

flaggedImages := []string{}
for _, image := range poolStatus.Images {
err := flagImageForResync(poolName, image.Name)
if err != nil {
return fmt.Errorf("failed to resync %s/%s", poolName, image.Name)
}
flaggedImages = append(flaggedImages, image.Name)
}

logger.Infof("REPRBD: Resynced %v images in %s pool.", flaggedImages, poolName)
return nil
}

// getPeerUUID returns the peer ID for the requested peer name.
func getPeerUUID(pool string, peerName string, client string, cluster string) string {
poolInfo, err := GetRbdMirrorPoolInfo(pool, cluster, client)
Expand Down Expand Up @@ -464,6 +486,60 @@ func configureImageFeatures(pool string, image string, op string, feature string
return nil
}

// enableImageFeatures enables the list of rbd features on the requested resource.
func enableRbdImageFeatures(poolName string, imageName string, features []string) error {
for _, feature := range features {
err := configureImageFeatures(poolName, imageName, "enable", feature)
if err != nil && !strings.Contains(err.Error(), "one or more requested features are already enabled") {
return err
}
}
return nil
}

// disableRbdImageFeatures disables the list of rbd features on the requested resource.
func disableRbdImageFeatures(poolName string, imageName string, features []string) error {
for _, feature := range features {
err := configureImageFeatures(poolName, imageName, "disable", feature)
if err != nil {
return err
}
}
return nil
}

// Promote local pool to primary.
func handlePoolPromotion(rh *RbdReplicationHandler, poolName string) error {
err := promotePool(poolName, rh.Request.IsForceOp, "", "")
if err != nil {
if strings.Contains(err.Error(), "image is primary within a remote cluster or demotion is not propagated yet") {
return fmt.Errorf("unable to promote %s, use --force if you understand the risks of this operation: %v", poolName, err)
}

return err
}
return nil
}

// Demote local pool to secondary.
func handlePoolDemotion(_ *RbdReplicationHandler, poolName string) error {
return demotePool(poolName, "", "")
}

// flagImageForResync flags requested mirroring image in the given pool for resync.
func flagImageForResync(poolName string, imageName string) error {
args := []string{
"mirror", "image", "resync", fmt.Sprintf("%s/%s", poolName, imageName),
}

_, err := processExec.RunCommand("rbd", args...)
if err != nil {
return err
}

return nil
}

// peerBootstrapCreate generates peer bootstrap token on remote ceph cluster.
func peerBootstrapCreate(pool string, client string, cluster string) (string, error) {
args := []string{
Expand Down Expand Up @@ -525,6 +601,46 @@ func peerRemove(pool string, peerId string, localName string, remoteName string)
return nil
}

func promotePool(poolName string, isForce bool, remoteName string, localName string) error {
args := []string{
"mirror", "pool", "promote", poolName,
}

if isForce {
args = append(args, "--force")
}

// add --cluster and --id args
args = appendRemoteClusterArgs(args, remoteName, localName)

output, err := processExec.RunCommand("rbd", args...)
if err != nil {
return fmt.Errorf("failed to promote pool(%s): %v", poolName, err)
}

// TODO: Change to debugf
logger.Infof("REPRBD: Promotion Output: %s", output)
return nil
}

func demotePool(poolName string, remoteName string, localName string) error {
args := []string{
"mirror", "pool", "demote", poolName,
}

// add --cluster and --id args
args = appendRemoteClusterArgs(args, remoteName, localName)

output, err := processExec.RunCommand("rbd", args...)
if err != nil {
return fmt.Errorf("failed to promote pool(%s): %v", poolName, err)
}

// TODO: Change to debugf
logger.Infof("REPRBD: Demotion Output: %s", output)
return nil
}

// ########################### HELPERS ###########################

func IsRemoteConfiguredForRbdMirror(remoteName string) bool {
Expand Down
40 changes: 40 additions & 0 deletions microceph/ceph/rbd_mirror_test.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package ceph

import (
"fmt"
"os"
"testing"

Expand Down Expand Up @@ -93,3 +94,42 @@ func (ks *RbdMirrorSuite) TestPoolInfo() {
assert.Equal(ks.T(), resp.LocalSiteName, "magical")
assert.Equal(ks.T(), resp.Peers[0].RemoteName, "simple")
}
func (ks *RbdMirrorSuite) TestPromotePoolOnSecondary() {
r := mocks.NewRunner(ks.T())
output, _ := os.ReadFile("./test_assets/rbd_mirror_promote_secondary_failure.txt")

// mocks and expectations
r.On("RunCommand", []interface{}{
"rbd", "mirror", "pool", "promote", "pool"}...).Return("", fmt.Errorf("%s", string(output))).Once()
r.On("RunCommand", []interface{}{
"rbd", "mirror", "pool", "promote", "pool", "--force"}...).Return("ok", nil).Once()
processExec = r

// Method call
rh := RbdReplicationHandler{}

// Test stardard promotion.
rh.Request.IsForceOp = false
err := handlePoolPromotion(&rh, "pool")
assert.ErrorContains(ks.T(), err, "use --force if you understand the risks of this operation")

rh.Request.IsForceOp = true
err = handlePoolPromotion(&rh, "pool")
assert.NoError(ks.T(), err)
}

func (ks *RbdMirrorSuite) TestDemotePoolOnSecondary() {
r := mocks.NewRunner(ks.T())

// mocks and expectations
r.On("RunCommand", []interface{}{
"rbd", "mirror", "pool", "demote", "pool"}...).Return("ok", nil).Once()
processExec = r

// Method call
rh := RbdReplicationHandler{}

// Test stardard promotion.
err := handlePoolDemotion(&rh, "pool")
assert.NoError(ks.T(), err)
}
Loading

0 comments on commit 94ae9b5

Please sign in to comment.