Non-zero downtime for failover mechanism master-replica #275

nikatar · 2024-12-13T14:03:04Z

How to Reproduce

Deploy Dragonfly in a master-replica configuration using the operator with the following configuration:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  name: dragonfly-clicker
  namespace: redis
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-clicker
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: dragonfly-operator
spec:
  image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.25.5
  replicas: 2
  resources:
    requests:
      cpu: 600m
      memory: 4Gi
    limits:
      memory: 6Gi
  args: ["--dbfilename=backup"]
  snapshot:
    cron: "*/5 * * * *"
    persistentVolumeClaimSpec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi

Connect an application or a simple script to Dragonfly that regularly accesses it(via service, for example, dragonfly-clicker.redis.svc.cluster.local:6379").
Delete the pod with the role of "master".

After this, for some time, you will see errors like:

"level":"ERROR","ts":"2024-12-13T12:33:19.406Z","caller":"click/service.go:137","msg":"click worker","error":"connection to redis: dial tcp 172.27.233.230:6379

Current Behavior

The pod with the role of "master" is terminated.
A new pod is started.
The new pod becomes a replica.
The old replica is promoted to "master".
Dragonfly becomes available only after the master and replica have synchronized(endpoint has been updated)
Until this process completes (which can take a long time with large data volumes, for example, with 10GB of cache it took a minute or more), Dragonfly is unavailable to the application because the endpoint is not updated.

Expected Behavior

When the master pod fails, the remaining replica immediately becomes the "master".
The endpoint updates instantly.
Downtime is either nonexistent or minimal.
A new pod is started after the switch, connects as a replica to the master, and synchronizes without causing any downtime in servicing requests.

Environment

EKS 1.29
dragonfly-operator v1.1.18
dragonfly v1.25.5

The text was updated successfully, but these errors were encountered:

Pothulapati · 2025-01-08T08:20:26Z

The expected behaviour is indeed expected. There will be downtime as you mentioned as Kubernetes service updates are eventually consistent, and it would take a bit before all the traffic is routed to the "new" master (i.e old replica).

Currently, I don't think this service update waits for all the replicas to be synchronized. How are you sure that this is the case?

Thanks

K-MeLeOn · 2025-01-10T18:28:46Z

On my side, the failover is immediate, but my application that uses the db crashes immediately and restarts if an http request is made. In my specific use case, this is an application problem which I need to set a timeout. In other cases, it may be a liveliness probe or heath check question.

Maybe that's where the problem @nikatar ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-zero downtime for failover mechanism master-replica #275

Non-zero downtime for failover mechanism master-replica #275

nikatar commented Dec 13, 2024 •

edited

Loading

Pothulapati commented Jan 8, 2025

K-MeLeOn commented Jan 10, 2025

Non-zero downtime for failover mechanism master-replica #275

Non-zero downtime for failover mechanism master-replica #275

Comments

nikatar commented Dec 13, 2024 • edited Loading

How to Reproduce

Current Behavior

Expected Behavior

Environment

Pothulapati commented Jan 8, 2025

K-MeLeOn commented Jan 10, 2025

nikatar commented Dec 13, 2024 •

edited

Loading