diff --git a/commands/bitop.md b/commands/bitop.md index 354e196d..0e67dff0 100644 --- a/commands/bitop.md +++ b/commands/bitop.md @@ -56,4 +56,4 @@ Care should be taken when running it against long input strings. For real-time metrics and statistics involving large inputs a good approach is to use a replica (with replica-read-only option enabled) where the bit-wise -operations are performed to avoid blocking the master instance. +operations are performed to avoid blocking the primary instance. diff --git a/commands/client-list.md b/commands/client-list.md index a50f4c1c..8afb985a 100644 --- a/commands/client-list.md +++ b/commands/client-list.md @@ -49,7 +49,7 @@ c: connection to be closed after writing entire reply d: a watched keys has been modified - EXEC will fail e: the client is excluded from the client eviction mechanism i: the client is waiting for a VM I/O (deprecated) -M: the client is a master +M: the client is a primary N: no specific flag set O: the client is a client in MONITOR mode P: the client is a Pub/Sub subscriber diff --git a/commands/client-pause.md b/commands/client-pause.md index 18d974dc..4decf975 100644 --- a/commands/client-pause.md +++ b/commands/client-pause.md @@ -21,12 +21,12 @@ For the `WRITE` mode, some commands have special behavior: This command is useful as it makes able to switch clients from a Valkey instance to another one in a controlled way. For example during an instance upgrade the system administrator could do the following: * Pause the clients using `CLIENT PAUSE` -* Wait a few seconds to make sure the replicas processed the latest replication stream from the master. -* Turn one of the replicas into a master. -* Reconfigure clients to connect with the new master. +* Wait a few seconds to make sure the replicas processed the latest replication stream from the primary. +* Turn one of the replicas into a primary. +* Reconfigure clients to connect with the new primary. The recommended mode for client pause is `WRITE`. This mode will stop all replication traffic, can be -aborted with the `CLIENT UNPAUSE` command, and allows reconfiguring the old master without risking accepting writes after the +aborted with the `CLIENT UNPAUSE` command, and allows reconfiguring the old primary without risking accepting writes after the failover. This is also the mode used during cluster failover. This command also prevents keys to be evicted or expired during the time clients are paused. diff --git a/commands/cluster-addslots.md b/commands/cluster-addslots.md index d5001e77..dd650a81 100644 --- a/commands/cluster-addslots.md +++ b/commands/cluster-addslots.md @@ -29,7 +29,7 @@ are already assigned: This command only works in cluster mode and is useful in the following Valkey Cluster operations: -1. To create a new `cluster ADDSLOTS` is used in order to initially setup master nodes splitting the available hash slots among them. +1. To create a new `cluster ADDSLOTS` is used in order to initially setup primary nodes splitting the available hash slots among them. 2. In order to fix a broken cluster where certain slots are unassigned. ## Information about slots propagation and warnings diff --git a/commands/cluster-addslotsrange.md b/commands/cluster-addslotsrange.md index eb5816ab..77f8c991 100644 --- a/commands/cluster-addslotsrange.md +++ b/commands/cluster-addslotsrange.md @@ -19,5 +19,5 @@ The same operation can be completed with the following `CLUSTER ADDSLOTSRANGE` c This command only works in cluster mode and is useful in the following Valkey Cluster operations: -1. To create a new cluster, `CLUSTER ADDSLOTSRANGE` is used to initially set up master nodes splitting the available hash slots among them. +1. To create a new cluster, `CLUSTER ADDSLOTSRANGE` is used to initially set up primary nodes splitting the available hash slots among them. 2. In order to fix a broken cluster where certain slots are unassigned. diff --git a/commands/cluster-count-failure-reports.md b/commands/cluster-count-failure-reports.md index 5cd7541e..61c80226 100644 --- a/commands/cluster-count-failure-reports.md +++ b/commands/cluster-count-failure-reports.md @@ -1,7 +1,7 @@ The command returns the number of *failure reports* for the specified node. Failure reports are the way Valkey Cluster uses in order to promote a `PFAIL` state, that means a node is not reachable, to a `FAIL` state, -that means that the majority of masters in the cluster agreed within +that means that the majority of primaries in the cluster agreed within a window of time that the node is not reachable. A few more details: @@ -10,7 +10,7 @@ A few more details: * Nodes in `PFAIL` state are provided in gossip sections of heartbeat packets. * Every time a node processes gossip packets from other nodes, it creates (and refreshes the TTL if needed) **failure reports**, remembering that a given node said another given node is in `PFAIL` condition. * Each failure report has a time to live of two times the *node timeout* time. -* If at a given time a node has another node flagged with `PFAIL`, and at the same time collected the majority of other master nodes *failure reports* about this node (including itself if it is a master), then it elevates the failure state of the node from `PFAIL` to `FAIL`, and broadcasts a message forcing all the nodes that can be reached to flag the node as `FAIL`. +* If at a given time a node has another node flagged with `PFAIL`, and at the same time collected the majority of other primary nodes *failure reports* about this node (including itself if it is a primary), then it elevates the failure state of the node from `PFAIL` to `FAIL`, and broadcasts a message forcing all the nodes that can be reached to flag the node as `FAIL`. This command returns the number of failure reports for the current node which are currently not expired (so received within two times the *node timeout* time). The count does not include what the node we are asking this count believes about the node ID we pass as argument, the count *only* includes the failure reports the node received from other nodes. diff --git a/commands/cluster-delslots.md b/commands/cluster-delslots.md index 8c37619f..e7c7e9ac 100644 --- a/commands/cluster-delslots.md +++ b/commands/cluster-delslots.md @@ -1,8 +1,8 @@ -In Valkey Cluster, each node keeps track of which master is serving +In Valkey Cluster, each node keeps track of which primary is serving a particular hash slot. The `CLUSTER DELSLOTS` command asks a particular Valkey Cluster node to -forget which master is serving the hash slots specified as arguments. +forget which primary is serving the hash slots specified as arguments. In the context of a node that has received a `CLUSTER DELSLOTS` command and has consequently removed the associations for the passed hash slots, diff --git a/commands/cluster-failover.md b/commands/cluster-failover.md index 05e6360e..ebdf5da2 100644 --- a/commands/cluster-failover.md +++ b/commands/cluster-failover.md @@ -1,40 +1,40 @@ This command, that can only be sent to a Valkey Cluster replica node, forces -the replica to start a manual failover of its master instance. +the replica to start a manual failover of its primary instance. A manual failover is a special kind of failover that is usually executed when -there are no actual failures, but we wish to swap the current master with one +there are no actual failures, but we wish to swap the current primary with one of its replicas (which is the node we send the command to), in a safe way, without any window for data loss. It works in the following way: -1. The replica tells the master to stop processing queries from clients. -2. The master replies to the replica with the current *replication offset*. -3. The replica waits for the replication offset to match on its side, to make sure it processed all the data from the master before it continues. -4. The replica starts a failover, obtains a new configuration epoch from the majority of the masters, and broadcasts the new configuration. -5. The old master receives the configuration update: unblocks its clients and starts replying with redirection messages so that they'll continue the chat with the new master. +1. The replica tells the primary to stop processing queries from clients. +2. The primary replies to the replica with the current *replication offset*. +3. The replica waits for the replication offset to match on its side, to make sure it processed all the data from the primary before it continues. +4. The replica starts a failover, obtains a new configuration epoch from the majority of the primaries, and broadcasts the new configuration. +5. The old primary receives the configuration update: unblocks its clients and starts replying with redirection messages so that they'll continue the chat with the new primary. -This way clients are moved away from the old master to the new master -atomically and only when the replica that is turning into the new master -has processed all of the replication stream from the old master. +This way clients are moved away from the old primary to the new primary +atomically and only when the replica that is turning into the new primary +has processed all of the replication stream from the old primary. -## FORCE option: manual failover when the master is down +## FORCE option: manual failover when the primary is down The command behavior can be modified by two options: **FORCE** and **TAKEOVER**. If the **FORCE** option is given, the replica does not perform any handshake -with the master, that may be not reachable, but instead just starts a +with the primary, that may be not reachable, but instead just starts a failover ASAP starting from point 4. This is useful when we want to start -a manual failover while the master is no longer reachable. +a manual failover while the primary is no longer reachable. -However using **FORCE** we still need the majority of masters to be available +However using **FORCE** we still need the majority of primaries to be available in order to authorize the failover and generate a new configuration epoch -for the replica that is going to become master. +for the replica that is going to become primary. ## TAKEOVER option: manual failover without cluster consensus There are situations where this is not enough, and we want a replica to failover without any agreement with the rest of the cluster. A real world use case -for this is to mass promote replicas in a different data center to masters -in order to perform a data center switch, while all the masters are down +for this is to mass promote replicas in a different data center to primaries +in order to perform a data center switch, while all the primaries are down or partitioned away. The **TAKEOVER** option implies everything **FORCE** implies, but also does @@ -42,7 +42,7 @@ not uses any cluster authorization in order to failover. A replica receiving `CLUSTER FAILOVER TAKEOVER` will instead: 1. Generate a new `configEpoch` unilaterally, just taking the current greatest epoch available and incrementing it if its local configuration epoch is not already the greatest. -2. Assign itself all the hash slots of its master, and propagate the new configuration to every node which is reachable ASAP, and eventually to every other node. +2. Assign itself all the hash slots of its primary, and propagate the new configuration to every node which is reachable ASAP, and eventually to every other node. Note that **TAKEOVER violates the last-failover-wins principle** of Valkey Cluster, since the configuration epoch generated by the replica violates the normal generation of configuration epochs in several ways: @@ -56,8 +56,8 @@ Because of this the **TAKEOVER** option should be used with care. * `CLUSTER FAILOVER`, unless the **TAKEOVER** option is specified, does not execute a failover synchronously. It only *schedules* a manual failover, bypassing the failure detection stage. * An `OK` reply is no guarantee that the failover will succeed. -* A replica can only be promoted to a master if it is known as a replica by a majority of the masters in the cluster. - If the replica is a new node that has just been added to the cluster (for example after upgrading it), it may not yet be known to all the masters in the cluster. - To check that the masters are aware of a new replica, you can send `CLUSTER NODES` or `CLUSTER REPLICAS` to each of the master nodes and check that it appears as a replica, before sending `CLUSTER FAILOVER` to the replica. +* A replica can only be promoted to a primary if it is known as a replica by a majority of the primaries in the cluster. + If the replica is a new node that has just been added to the cluster (for example after upgrading it), it may not yet be known to all the primaries in the cluster. + To check that the primaries are aware of a new replica, you can send `CLUSTER NODES` or `CLUSTER REPLICAS` to each of the primary nodes and check that it appears as a replica, before sending `CLUSTER FAILOVER` to the replica. * To check that the failover has actually happened you can use `ROLE`, `INFO REPLICATION` (which indicates "role:master" after successful failover), or `CLUSTER NODES` to verify that the state of the cluster has changed sometime after the command was sent. * To check if the failover has failed, check the replica's log for "Manual failover timed out", which is logged if the replica has given up after a few seconds. diff --git a/commands/cluster-forget.md b/commands/cluster-forget.md index 31f7f2f1..fd04c769 100644 --- a/commands/cluster-forget.md +++ b/commands/cluster-forget.md @@ -6,7 +6,7 @@ node receiving the command. Because when a given node is part of the cluster, all the other nodes participating in the cluster knows about it, in order for a node to be completely removed from a cluster, the `CLUSTER FORGET` command must be -sent to all the remaining nodes, regardless of the fact they are masters +sent to all the remaining nodes, regardless of the fact they are primaries or replicas. However the command cannot simply drop the node from the internal node @@ -49,5 +49,5 @@ we want to remove a node. The command does not succeed and returns an error in the following cases: 1. The specified node ID is not found in the nodes table. -2. The node receiving the command is a replica, and the specified node ID identifies its current master. +2. The node receiving the command is a replica, and the specified node ID identifies its current primary. 3. The node ID identifies the same node we are sending the command to. diff --git a/commands/cluster-info.md b/commands/cluster-info.md index 905e7afc..8c902d79 100644 --- a/commands/cluster-info.md +++ b/commands/cluster-info.md @@ -16,13 +16,13 @@ cluster_stats_messages_received:1483968 total_cluster_links_buffer_limit_exceeded:0 ``` -* `cluster_state`: State is `ok` if the node is able to receive queries. `fail` if there is at least one hash slot which is unbound (no node associated), in error state (node serving it is flagged with FAIL flag), or if the majority of masters can't be reached by this node. +* `cluster_state`: State is `ok` if the node is able to receive queries. `fail` if there is at least one hash slot which is unbound (no node associated), in error state (node serving it is flagged with FAIL flag), or if the majority of primaries can't be reached by this node. * `cluster_slots_assigned`: Number of slots which are associated to some node (not unbound). This number should be 16384 for the node to work properly, which means that each hash slot should be mapped to a node. * `cluster_slots_ok`: Number of hash slots mapping to a node not in `FAIL` or `PFAIL` state. * `cluster_slots_pfail`: Number of hash slots mapping to a node in `PFAIL` state. Note that those hash slots still work correctly, as long as the `PFAIL` state is not promoted to `FAIL` by the failure detection algorithm. `PFAIL` only means that we are currently not able to talk with the node, but may be just a transient error. * `cluster_slots_fail`: Number of hash slots mapping to a node in `FAIL` state. If this number is not zero the node is not able to serve queries unless `cluster-require-full-coverage` is set to `no` in the configuration. * `cluster_known_nodes`: The total number of known nodes in the cluster, including nodes in `HANDSHAKE` state that may not currently be proper members of the cluster. -* `cluster_size`: The number of master nodes serving at least one hash slot in the cluster. +* `cluster_size`: The number of primary nodes serving at least one hash slot in the cluster. * `cluster_current_epoch`: The local `Current Epoch` variable. This is used in order to create unique increasing version numbers during fail overs. * `cluster_my_epoch`: The `Config Epoch` of the node we are talking with. This is the current configuration version assigned to this node. * `cluster_stats_messages_sent`: Number of messages sent via the cluster node-to-node binary bus. @@ -38,7 +38,7 @@ Here are the explanation of these fields: * `cluster_stats_messages_meet_sent` and `cluster_stats_messages_meet_received`: Handshake message sent to a new node, either through gossip or `CLUSTER MEET`. * `cluster_stats_messages_fail_sent` and `cluster_stats_messages_fail_received`: Mark node xxx as failing. * `cluster_stats_messages_publish_sent` and `cluster_stats_messages_publish_received`: Pub/Sub Publish propagation, see [Pubsub](../topics/pubsub.md#pubsub). -* `cluster_stats_messages_auth-req_sent` and `cluster_stats_messages_auth-req_received`: Replica initiated leader election to replace its master. +* `cluster_stats_messages_auth-req_sent` and `cluster_stats_messages_auth-req_received`: Replica initiated leader election to replace its primary. * `cluster_stats_messages_auth-ack_sent` and `cluster_stats_messages_auth-ack_received`: Message indicating a vote during leader election. * `cluster_stats_messages_update_sent` and `cluster_stats_messages_update_received`: Another node slots configuration. * `cluster_stats_messages_mfstart_sent` and `cluster_stats_messages_mfstart_received`: Pause clients for manual failover. diff --git a/commands/cluster-replicas.md b/commands/cluster-replicas.md index 6d0e6370..b2d87a95 100644 --- a/commands/cluster-replicas.md +++ b/commands/cluster-replicas.md @@ -1,11 +1,11 @@ The command provides a list of replica nodes replicating from the specified -master node. The list is provided in the same format used by `CLUSTER NODES` (please refer to its documentation for the specification of the format). +primary node. The list is provided in the same format used by `CLUSTER NODES` (please refer to its documentation for the specification of the format). The command will fail if the specified node is not known or if it is not -a master according to the node table of the node receiving the command. +a primary according to the node table of the node receiving the command. -Note that if a replica is added, moved, or removed from a given master node, +Note that if a replica is added, moved, or removed from a given primary node, and we ask `CLUSTER REPLICAS` to a node that has not yet received the configuration update, it may show stale information. However eventually (in a matter of seconds if there are no network partitions) all the nodes -will agree about the set of nodes associated with a given master. +will agree about the set of nodes associated with a given primary. diff --git a/commands/cluster-replicate.md b/commands/cluster-replicate.md index 9d3c36d2..57726a47 100644 --- a/commands/cluster-replicate.md +++ b/commands/cluster-replicate.md @@ -1,8 +1,8 @@ -The command reconfigures a node as a replica of the specified master. -If the node receiving the command is an *empty master*, as a side effect -of the command, the node role is changed from master to replica. +The command reconfigures a node as a replica of the specified primary. +If the node receiving the command is an *empty primary*, as a side effect +of the command, the node role is changed from primary to replica. -Once a node is turned into the replica of another master node, there is no need +Once a node is turned into the replica of another primary node, there is no need to inform the other cluster nodes about the change: heartbeat packets exchanged between nodes will propagate the new configuration automatically. @@ -10,13 +10,13 @@ A replica will always accept the command, assuming that: 1. The specified node ID exists in its nodes table. 2. The specified node ID does not identify the instance we are sending the command to. -3. The specified node ID is a master. +3. The specified node ID is a primary. -If the node receiving the command is not already a replica, but is a master, +If the node receiving the command is not already a replica, but is a primary, the command will only succeed, and the node will be converted into a replica, only if the following additional conditions are met: 1. The node is not serving any hash slots. 2. The node is empty, no keys are stored at all in the key space. -If the command succeeds the new replica will immediately try to contact its master in order to replicate from it. +If the command succeeds the new replica will immediately try to contact its primary in order to replicate from it. diff --git a/commands/cluster-reset.md b/commands/cluster-reset.md index 736dfbe1..620a31c6 100644 --- a/commands/cluster-reset.md +++ b/commands/cluster-reset.md @@ -1,14 +1,14 @@ Reset a Valkey Cluster node, in a more or less drastic way depending on the reset type, that can be **hard** or **soft**. Note that this command -**does not work for masters if they hold one or more keys**, in that case -to completely reset a master node keys must be removed first, e.g. by using `FLUSHALL` first, +**does not work for primaries if they hold one or more keys**, in that case +to completely reset a primary node keys must be removed first, e.g. by using `FLUSHALL` first, and then `CLUSTER RESET`. Effects on the node: 1. All the other nodes in the cluster are forgotten. 2. All the assigned / open slots are reset, so the slots-to-nodes mapping is totally cleared. -3. If the node is a replica it is turned into an (empty) master. Its dataset is flushed, so at the end the node will be an empty master. +3. If the node is a replica it is turned into an (empty) primary. Its dataset is flushed, so at the end the node will be an empty primary. 4. **Hard reset only**: a new Node ID is generated. 5. **Hard reset only**: `currentEpoch` and `configEpoch` vars are set to 0. 6. The new configuration is persisted on disk in the node cluster configuration file. diff --git a/commands/cluster-setslot.md b/commands/cluster-setslot.md index 85a8dda2..4b0b2fa0 100644 --- a/commands/cluster-setslot.md +++ b/commands/cluster-setslot.md @@ -69,7 +69,7 @@ The `CLUSTER SETSLOT` command is an important piece used by Valkey Cluster in or 3. Get keys from the source node with `CLUSTER GETKEYSINSLOT` command and move them into the destination node using the `MIGRATE` command. 4. Send `CLUSTER SETSLOT NODE ` to the destination node. 5. Send `CLUSTER SETSLOT NODE ` to the source node. -6. Send `CLUSTER SETSLOT NODE ` to the other master nodes (optional). +6. Send `CLUSTER SETSLOT NODE ` to the other primary nodes (optional). Notes: diff --git a/commands/cluster-shards.md b/commands/cluster-shards.md index 5042806a..5a674ee2 100644 --- a/commands/cluster-shards.md +++ b/commands/cluster-shards.md @@ -1,6 +1,6 @@ `CLUSTER SHARDS` returns details about the shards of the cluster. A shard is defined as a collection of nodes that serve the same set of slots and that replicate from each other. -A shard may only have a single master at a given time, but may have multiple or no replicas. +A shard may only have a single primary at a given time, but may have multiple or no replicas. It is possible for a shard to not be serving any slots while still having replicas. This command replaces the `CLUSTER SLOTS` command, by providing a more efficient and extensible representation of the cluster. diff --git a/commands/cluster-slaves.md b/commands/cluster-slaves.md index 604f971e..67831dd4 100644 --- a/commands/cluster-slaves.md +++ b/commands/cluster-slaves.md @@ -1,13 +1,13 @@ **A note about the word slave used in this man page and command name**: If not for Redis OSS compatibility, the Valkey project does not use the word slave. Please use the new command `CLUSTER REPLICAS`. The command `CLUSTER SLAVES` will continue to work for backward compatibility. The command provides a list of replica nodes replicating from the specified -master node. The list is provided in the same format used by `CLUSTER NODES` (please refer to its documentation for the specification of the format). +primary node. The list is provided in the same format used by `CLUSTER NODES` (please refer to its documentation for the specification of the format). The command will fail if the specified node is not known or if it is not -a master according to the node table of the node receiving the command. +a primary according to the node table of the node receiving the command. -Note that if a replica is added, moved, or removed from a given master node, +Note that if a replica is added, moved, or removed from a given primary node, and we ask `CLUSTER SLAVES` to a node that has not yet received the configuration update, it may show stale information. However eventually (in a matter of seconds if there are no network partitions) all the nodes -will agree about the set of nodes associated with a given master. +will agree about the set of nodes associated with a given primary. diff --git a/commands/cluster-slots.md b/commands/cluster-slots.md index 2babfe2d..dc539871 100644 --- a/commands/cluster-slots.md +++ b/commands/cluster-slots.md @@ -28,20 +28,20 @@ Each nested result is: - Start slot range - End slot range - - Master for slot range represented as nested networking information - - First replica of master for slot range + - Primary for slot range represented as nested networking information + - First replica of primary for slot range - Second replica - - ...continues until all replicas for this master are returned. + - ...continues until all replicas for this primary are returned. -Each result includes all active replicas of the master instance +Each result includes all active replicas of the primary instance for the listed slot range. Failed replicas are not returned. The command response is deterministic across all nodes in a cluster, which means that if two nodes return the same response they have the same view of the cluster. Primaries are ordered by the slots they serve and then replicas are ordered lexicographically by the node-id they were assigned by the cluster. -The third nested reply is guaranteed to be the networking information of the master instance for the slot range. -All networking information after the third nested reply are replicas of the master. +The third nested reply is guaranteed to be the networking information of the primary instance for the slot range. +All networking information after the third nested reply are replicas of the primary. -If a cluster instance has non-contiguous slots (e.g. 1-400,900,1800-6000) then master and replica networking information results will be duplicated for each top-level slot range reply. +If a cluster instance has non-contiguous slots (e.g. 1-400,900,1800-6000) then primary and replica networking information results will be duplicated for each top-level slot range reply. ``` > CLUSTER SLOTS diff --git a/commands/expire.md b/commands/expire.md index 53759926..69582439 100644 --- a/commands/expire.md +++ b/commands/expire.md @@ -170,11 +170,11 @@ second divided by 4. In order to obtain a correct behavior without sacrificing consistency, when a key expires, a `DEL` operation is synthesized in both the AOF file and gains all the attached replicas nodes. -This way the expiration process is centralized in the master instance, and there +This way the expiration process is centralized in the primary instance, and there is no chance of consistency errors. -However while the replicas connected to a master will not expire keys -independently (but will wait for the `DEL` coming from the master), they'll +However while the replicas connected to a primary will not expire keys +independently (but will wait for the `DEL` coming from the primary), they'll still take the full state of the expires existing in the dataset, so when a -replica is elected to master it will be able to expire the keys independently, -fully acting as a master. +replica is elected to primary it will be able to expire the keys independently, +fully acting as a primary. diff --git a/commands/failover.md b/commands/failover.md index 0824ba67..ae378efb 100644 --- a/commands/failover.md +++ b/commands/failover.md @@ -1,50 +1,50 @@ -This command will start a coordinated failover between the currently-connected-to master and one of its replicas. +This command will start a coordinated failover between the currently-connected-to primary and one of its replicas. The failover is not synchronous, instead a background task will handle coordinating the failover. It is designed to limit data loss and unavailability of the cluster during the failover. This command is analogous to the `CLUSTER FAILOVER` command for non-clustered Valkey and is similar to the failover support provided by sentinel. The specific details of the default failover flow are as follows: -1. The master will internally start a `CLIENT PAUSE WRITE`, which will pause incoming writes and prevent the accumulation of new data in the replication stream. -2. The master will monitor its replicas, waiting for a replica to indicate that it has fully consumed the replication stream. If the master has multiple replicas, it will only wait for the first replica to catch up. -3. The master will then demote itself to a replica. This is done to prevent any dual master scenarios. NOTE: The master will not discard its data, so it will be able to rollback if the replica rejects the failover request in the next step. -4. The previous master will send a special PSYNC request to the target replica, `PSYNC FAILOVER`, instructing the target replica to become a master. -5. Once the previous master receives acknowledgement the `PSYNC FAILOVER` was accepted it will unpause its clients. If the PSYNC request is rejected, the master will abort the failover and return to normal. +1. The primary will internally start a `CLIENT PAUSE WRITE`, which will pause incoming writes and prevent the accumulation of new data in the replication stream. +2. The primary will monitor its replicas, waiting for a replica to indicate that it has fully consumed the replication stream. If the primary has multiple replicas, it will only wait for the first replica to catch up. +3. The primary will then demote itself to a replica. This is done to prevent any dual primary scenarios. NOTE: The primary will not discard its data, so it will be able to rollback if the replica rejects the failover request in the next step. +4. The previous primary will send a special PSYNC request to the target replica, `PSYNC FAILOVER`, instructing the target replica to become a primary. +5. Once the previous primary receives acknowledgement the `PSYNC FAILOVER` was accepted it will unpause its clients. If the PSYNC request is rejected, the primary will abort the failover and return to normal. The field `master_failover_state` in `INFO replication` can be used to track the current state of the failover, which has the following values: * `no-failover`: There is no ongoing coordinated failover. -* `waiting-for-sync`: The master is waiting for the replica to catch up to its replication offset. -* `failover-in-progress`: The master has demoted itself, and is attempting to hand off ownership to a target replica. +* `waiting-for-sync`: The primary is waiting for the replica to catch up to its replication offset. +* `failover-in-progress`: The primary has demoted itself, and is attempting to hand off ownership to a target replica. NOTE: -During the `failover-in-progress` phase, the master first demotes itself to a replica and then notifies the replica to promote itself to master. +During the `failover-in-progress` phase, the primary first demotes itself to a replica and then notifies the replica to promote itself to primary. These two steps are an asynchronous process, which may result in the simultaneous existence of two nodes as replicas. -In this scenario, for clients that support REDIRECT (explicitly execute [CLIENT CAPA REDIRECT](client-capa.md)), the redirection result may bounce back and forth between the two replicas until the target replica completes the process of promoting itself to master. +In this scenario, for clients that support REDIRECT (explicitly execute [CLIENT CAPA REDIRECT](client-capa.md)), the redirection result may bounce back and forth between the two replicas until the target replica completes the process of promoting itself to primary. To avoid this situation, during the `failover-in-progress` phase, we temporarily suspend the clients that need to be redirected until the replica truly becomes the primary, and then resume the execution. -If the previous master had additional replicas attached to it, they will continue replicating from it as chained replicas. You will need to manually execute a `REPLICAOF` on these replicas to start replicating directly from the new master. +If the previous primary had additional replicas attached to it, they will continue replicating from it as chained replicas. You will need to manually execute a `REPLICAOF` on these replicas to start replicating directly from the new primary. ## Optional arguments The following optional arguments exist to modify the behavior of the failover flow: -* `TIMEOUT` *milliseconds* -- This option allows specifying a maximum time a master will wait in the `waiting-for-sync` state before aborting the failover attempt and rolling back. +* `TIMEOUT` *milliseconds* -- This option allows specifying a maximum time a primary will wait in the `waiting-for-sync` state before aborting the failover attempt and rolling back. This is intended to set an upper bound on the write outage the Valkey cluster can experience. Failovers typically happen in less than a second, but could take longer if there is a large amount of write traffic or the replica is already behind in consuming the replication stream. If this value is not specified, the timeout can be considered to be "infinite". -* `TO` *HOST* *PORT* -- This option allows designating a specific replica, by its host and port, to failover to. The master will wait specifically for this replica to catch up to its replication offset, and then failover to it. +* `TO` *HOST* *PORT* -- This option allows designating a specific replica, by its host and port, to failover to. The primary will wait specifically for this replica to catch up to its replication offset, and then failover to it. -* `FORCE` -- If both the `TIMEOUT` and `TO` options are set, the force flag can also be used to designate that that once the timeout has elapsed, the master should failover to the target replica instead of rolling back. +* `FORCE` -- If both the `TIMEOUT` and `TO` options are set, the force flag can also be used to designate that that once the timeout has elapsed, the primary should failover to the target replica instead of rolling back. This can be used for a best-effort attempt at a failover without data loss, but limiting write outage. -NOTE: The master will always rollback if the `PSYNC FAILOVER` request is rejected by the target replica. +NOTE: The primary will always rollback if the `PSYNC FAILOVER` request is rejected by the target replica. ## Failover abort The failover command is intended to be safe from data loss and corruption, but can encounter some scenarios it can not automatically remediate from and may get stuck. -For this purpose, the `FAILOVER ABORT` command exists, which will abort an ongoing failover and return the master to its normal state. -The command has no side effects if issued in the `waiting-for-sync` state but can introduce multi-master scenarios in the `failover-in-progress` state. -If a multi-master scenario is encountered, you will need to manually identify which master has the latest data and designate it as the master and have the other replicas. +For this purpose, the `FAILOVER ABORT` command exists, which will abort an ongoing failover and return the primary to its normal state. +The command has no side effects if issued in the `waiting-for-sync` state but can introduce multi-primary scenarios in the `failover-in-progress` state. +If a multi-primary scenario is encountered, you will need to manually identify which primary has the latest data and designate it as the primary and have the other replicas. NOTE: `REPLICAOF` is disabled while a failover is in progress, this is to prevent unintended interactions with the failover that might cause data loss. diff --git a/commands/georadius.md b/commands/georadius.md index 9628f157..143a80de 100644 --- a/commands/georadius.md +++ b/commands/georadius.md @@ -35,7 +35,7 @@ By default the command returns the items to the client. It is possible to store ## Read-only variants -Since `GEORADIUS` and `GEORADIUSBYMEMBER` have a `STORE` and `STOREDIST` option they are technically flagged as writing commands in the Valkey command table. For this reason read-only replicas will flag them, and Valkey Cluster replicas will redirect them to the master instance even if the connection is in read-only mode (see the `READONLY` command of Valkey Cluster). +Since `GEORADIUS` and `GEORADIUSBYMEMBER` have a `STORE` and `STOREDIST` option they are technically flagged as writing commands in the Valkey command table. For this reason read-only replicas will flag them, and Valkey Cluster replicas will redirect them to the primary instance even if the connection is in read-only mode (see the `READONLY` command of Valkey Cluster). Two read-only variants of the commands were added. They are exactly like the original commands but refuse the `STORE` and `STOREDIST` options. The two variants are called `GEORADIUS_RO` and `GEORADIUSBYMEMBER_RO`, and can safely be used in replicas. diff --git a/commands/info.md b/commands/info.md index 470446c1..8f234de8 100644 --- a/commands/info.md +++ b/commands/info.md @@ -8,7 +8,7 @@ The optional parameter can be used to select a specific section of information: * `memory`: Memory consumption related information * `persistence`: RDB and AOF related information * `stats`: General statistics -* `replication`: Master/replica replication information +* `replication`: primary/replica replication information * `cpu`: CPU consumption statistics * `commandstats`: Valkey command statistics * `latencystats`: Valkey command latency percentile distribution statistics @@ -557,8 +557,8 @@ Here is the meaning of all fields in the **stats** section: Here is the meaning of all fields in the **replication** section: -* `role`: Value is "master" if the instance is replica of no one, or "slave" if the instance is a replica of some master instance. - Note that a replica can be master of another replica (chained replication). +* `role`: Value is "master" if the instance is replica of no one, or "slave" if the instance is a replica of some primary instance. + Note that a replica can be primary of another replica (chained replication). * `master_failover_state`: The state of an ongoing failover, if any. * `master_replid`: The replication ID of the Valkey server. * `master_replid2`: The secondary replication ID, used for PSYNC after a failover. @@ -566,19 +566,19 @@ Here is the meaning of all fields in the **replication** section: * `second_repl_offset`: The offset up to which replication IDs are accepted * `repl_backlog_active`: Flag indicating replication backlog is active * `repl_backlog_size`: Total size in bytes of the replication backlog buffer -* `repl_backlog_first_byte_offset`: The master offset of the replication +* `repl_backlog_first_byte_offset`: The primary offset of the replication backlog buffer * `repl_backlog_histlen`: Size in bytes of the data in the replication backlog buffer If the instance is a replica, these additional fields are provided: -* `master_host`: Host or IP address of the master -* `master_port`: Master listening TCP port +* `master_host`: Host or IP address of the primary +* `master_port`: Primary listening TCP port * `master_link_status`: Status of the link (up/down) * `master_last_io_seconds_ago`: Number of seconds since the last interaction - with master -* `master_sync_in_progress`: Indicate the master is syncing to the replica + with primary +* `master_sync_in_progress`: Indicate the primary is syncing to the replica * `slave_read_repl_offset`: The read replication offset of the replica instance. * `slave_repl_offset`: The replication offset of the replica instance * `slave_priority`: The priority of the instance as a candidate for failover @@ -599,7 +599,7 @@ If a SYNC operation is on-going, these additional fields are provided: * `master_sync_last_io_seconds_ago`: Number of seconds since last transfer I/O during a SYNC operation -If the link between master and replica is down, an additional field is provided: +If the link between primary and replica is down, an additional field is provided: * `master_link_down_since_seconds`: Number of seconds since the link is down @@ -656,7 +656,7 @@ For each error type, the following line is added: The **sentinel** section is only available in Valkey Sentinel instances. It consists of the following fields: -* `sentinel_masters`: Number of Valkey masters monitored by this Sentinel instance +* `sentinel_masters`: Number of Valkey primaries monitored by this Sentinel instance * `sentinel_tilt`: A value of 1 means this sentinel is in TILT mode * `sentinel_tilt_since_seconds`: Duration in seconds of current TILT, or -1 if not TILTed. * `sentinel_running_scripts`: The number of scripts this Sentinel is currently executing diff --git a/commands/psync.md b/commands/psync.md index 6dd92ad6..10da773a 100644 --- a/commands/psync.md +++ b/commands/psync.md @@ -1,7 +1,7 @@ -Initiates a replication stream from the master. +Initiates a replication stream from the primary. The `PSYNC` command is called by Valkey replicas for initiating a replication -stream from the master. +stream from the primary. For more information about replication in Valkey please check the [replication page][tr]. diff --git a/commands/replconf.md b/commands/replconf.md index 2dd5c27a..61e6d222 100644 --- a/commands/replconf.md +++ b/commands/replconf.md @@ -1,2 +1,2 @@ The `REPLCONF` command is an internal command. -It is used by a Valkey master to configure a connected replica. \ No newline at end of file +It is used by a Valkey primary to configure a connected replica. \ No newline at end of file diff --git a/commands/replicaof.md b/commands/replicaof.md index f23bc506..ae7b295e 100644 --- a/commands/replicaof.md +++ b/commands/replicaof.md @@ -1,10 +1,10 @@ The `REPLICAOF` command can change the replication settings of a replica on the fly. -If a Valkey server is already acting as replica, the command `REPLICAOF` NO ONE will turn off the replication, turning the Valkey server into a MASTER. In the proper form `REPLICAOF` hostname port will make the server a replica of another server listening at the specified hostname and port. +If a Valkey server is already acting as replica, the command `REPLICAOF` NO ONE will turn off the replication, turning the Valkey server into a PRIMARY. In the proper form `REPLICAOF` hostname port will make the server a replica of another server listening at the specified hostname and port. -If a server is already a replica of some master, `REPLICAOF` hostname port will stop the replication against the old server and start the synchronization against the new one, discarding the old dataset. +If a server is already a replica of some primary, `REPLICAOF` hostname port will stop the replication against the old server and start the synchronization against the new one, discarding the old dataset. -The form `REPLICAOF` NO ONE will stop replication, turning the server into a MASTER, but will not discard the replication. So, if the old master stops working, it is possible to turn the replica into a master and set the application to use this new master in read/write. Later when the other Valkey server is fixed, it can be reconfigured to work as a replica. +The form `REPLICAOF` NO ONE will stop replication, turning the server into a MASTER, but will not discard the replication. So, if the old primary stops working, it is possible to turn the replica into a primary and set the application to use this new primary in read/write. Later when the other Valkey server is fixed, it can be reconfigured to work as a replica. ## Examples diff --git a/commands/restore-asking.md b/commands/restore-asking.md index d11a825c..3e974ff4 100644 --- a/commands/restore-asking.md +++ b/commands/restore-asking.md @@ -1,2 +1,2 @@ The `RESTORE-ASKING` command is an internal command. -It is used by a Valkey cluster master during slot migration. \ No newline at end of file +It is used by a Valkey cluster primary during slot migration. \ No newline at end of file diff --git a/commands/role.md b/commands/role.md index 6dddd7a2..8e16ba62 100644 --- a/commands/role.md +++ b/commands/role.md @@ -1,4 +1,4 @@ -Provide information on the role of a Valkey instance in the context of replication, by returning if the instance is currently a `master`, `slave`, or `sentinel`. The command also returns additional information about the state of the replication (if the role is master or slave) or the list of monitored master names (if the role is sentinel). +Provide information on the role of a Valkey instance in the context of replication, by returning if the instance is currently a `master`, `slave`, or `sentinel`. The command also returns additional information about the state of the replication (if the role is primary or replica) or the list of monitored primary names (if the role is sentinel). ## Output format @@ -11,9 +11,9 @@ the instance, as one of the following three strings: The additional elements of the array depends on the role. -## Master output +## Primary output -An example of output when `ROLE` is called in a master instance: +An example of output when `ROLE` is called in a primary instance: ``` 1) "master" @@ -26,10 +26,10 @@ An example of output when `ROLE` is called in a master instance: 3) "3129543" ``` -The master output is composed of the following parts: +The primary output is composed of the following parts: 1. The string `master`. -2. The current master replication offset, which is an offset that masters and replicas share to understand, in partial resynchronizations, the part of the replication stream the replicas needs to fetch to continue. +2. The current primary replication offset, which is an offset that primaries and replicas share to understand, in partial resynchronizations, the part of the replication stream the replicas needs to fetch to continue. 3. An array composed of three elements array representing the connected replicas. Every sub-array contains the replica IP, port, and the last acknowledged replication offset. ## Output of the command on replicas @@ -47,10 +47,10 @@ An example of output when `ROLE` is called in a replica instance: The replica output is composed of the following parts: 1. The string `slave`, because of Redis OSS compatibility (see note at the end of this page). -2. The IP of the master. -3. The port number of the master. -4. The state of the replication from the point of view of the master, that can be `connect` (the instance needs to connect to its master), `connecting` (the master-replica connection is in progress), `sync` (the master and replica are trying to perform the synchronization), `connected` (the replica is online). -5. The amount of data received from the replica so far in terms of master replication offset. +2. The IP of the primary. +3. The port number of the primary. +4. The state of the replication from the point of view of the primary, that can be `connect` (the instance needs to connect to its primary), `connecting` (the primary-replica connection is in progress), `sync` (the primary and replica are trying to perform the synchronization), `connected` (the replica is online). +5. The amount of data received from the replica so far in terms of primary replication offset. ## Sentinel output @@ -67,7 +67,7 @@ An example of Sentinel output: The sentinel output is composed of the following parts: 1. The string `sentinel`. -2. An array of master names monitored by this Sentinel instance. +2. An array of primary names monitored by this Sentinel instance. ## Examples diff --git a/commands/shutdown.md b/commands/shutdown.md index b681d1c9..fab84222 100644 --- a/commands/shutdown.md +++ b/commands/shutdown.md @@ -47,7 +47,7 @@ However, there are situations where it is unsafe to do so and, unless the **FORC This happens in the following situations: * The user just turned on AOF, and the server triggered the first AOF rewrite in order to create the initial AOF file. In this context, stopping will result in losing the dataset at all: once restarted, the server will potentially have AOF enabled without having any AOF file at all. -* A replica with AOF enabled, reconnected with its master, performed a full resynchronization, and restarted the AOF file, triggering the initial AOF creation process. In this case not completing the AOF rewrite is dangerous because the latest dataset received from the master would be lost. The new master can actually be even a different instance (if the **REPLICAOF** or **SLAVEOF** command was used in order to reconfigure the replica), so it is important to finish the AOF rewrite and start with the correct data set representing the data set in memory when the server was terminated. +* A replica with AOF enabled, reconnected with its primary, performed a full resynchronization, and restarted the AOF file, triggering the initial AOF creation process. In this case not completing the AOF rewrite is dangerous because the latest dataset received from the primary would be lost. The new primary can actually be even a different instance (if the **REPLICAOF** or **SLAVEOF** command was used in order to reconfigure the replica), so it is important to finish the AOF rewrite and start with the correct data set representing the data set in memory when the server was terminated. There are situations when we want just to terminate a Valkey instance ASAP, regardless of what its content is. In such a case, the command **SHUTDOWN NOW NOSAVE FORCE** can be used. @@ -59,5 +59,5 @@ The second command will not have any problem to execute since the AOF is no long The server waits for lagging replicas up to a configurable `shutdown-timeout`, by default 10 seconds, before shutting down. This provides a best effort minimizing the risk of data loss in a situation where no save points are configured and AOF is disabled. -Before version 7.0, shutting down a heavily loaded master node in a diskless setup was more likely to result in data loss. -To minimize the risk of data loss in such setups, it's advised to trigger a manual `FAILOVER` (or `CLUSTER FAILOVER`) to demote the master to a replica and promote one of the replicas to be the new master, before shutting down a master node. +Before version 7.0, shutting down a heavily loaded primary node in a diskless setup was more likely to result in data loss. +To minimize the risk of data loss in such setups, it's advised to trigger a manual `FAILOVER` (or `CLUSTER FAILOVER`) to demote the primary to a replica and promote one of the replicas to be the new primary, before shutting down a primary node. diff --git a/commands/slaveof.md b/commands/slaveof.md index 52a2e7c4..6455c214 100644 --- a/commands/slaveof.md +++ b/commands/slaveof.md @@ -6,13 +6,13 @@ turn off the replication, turning the Valkey server into a MASTER. In the proper form `SLAVEOF` hostname port will make the server a replica of another server listening at the specified hostname and port. -If a server is already a replica of some master, `SLAVEOF` hostname port will stop +If a server is already a replica of some primary, `SLAVEOF` hostname port will stop the replication against the old server and start the synchronization against the new one, discarding the old dataset. The form `SLAVEOF` NO ONE will stop replication, turning the server into a MASTER, but will not discard the replication. -So, if the old master stops working, it is possible to turn the replica into a -master and set the application to use this new master in read/write. +So, if the old primary stops working, it is possible to turn the replica into a +primary and set the application to use this new primary in read/write. Later when the other Valkey server is fixed, it can be reconfigured to work as a replica. diff --git a/commands/sort_ro.md b/commands/sort_ro.md index c6957ab8..ae9d081f 100644 --- a/commands/sort_ro.md +++ b/commands/sort_ro.md @@ -1,6 +1,6 @@ Read-only variant of the `SORT` command. It is exactly like the original `SORT` but refuses the `STORE` option and can safely be used in read-only replicas. -Since the original `SORT` has a `STORE` option it is technically flagged as a writing command in the Valkey command table. For this reason read-only replicas in a Valkey Cluster will redirect it to the master instance even if the connection is in read-only mode (see the `READONLY` command of Valkey Cluster). +Since the original `SORT` has a `STORE` option it is technically flagged as a writing command in the Valkey command table. For this reason read-only replicas in a Valkey Cluster will redirect it to the primary instance even if the connection is in read-only mode (see the `READONLY` command of Valkey Cluster). The `SORT_RO` variant was introduced in order to allow `SORT` behavior in read-only replicas without breaking compatibility on command flags. diff --git a/commands/sync.md b/commands/sync.md index 805c6fb1..2ba66542 100644 --- a/commands/sync.md +++ b/commands/sync.md @@ -1,7 +1,7 @@ -Initiates a replication stream from the master. +Initiates a replication stream from the primary. The `SYNC` command is called by Valkey replicas for initiating a replication -stream from the master. It has been replaced in newer versions of Valkey by +stream from the primary. It has been replaced in newer versions of Valkey by `PSYNC`. For more information about replication in Valkey please check the diff --git a/commands/wait.md b/commands/wait.md index ac59755a..39b01a24 100644 --- a/commands/wait.md +++ b/commands/wait.md @@ -19,14 +19,14 @@ Consistency and WAIT Note that `WAIT` does not make Valkey a strongly consistent store: while synchronous replication is part of a replicated state machine, it is not the only thing needed. However in the context of Sentinel or Valkey Cluster failover, `WAIT` improves the real world data safety. -Specifically if a given write is transferred to one or more replicas, it is more likely (but not guaranteed) that if the master fails, we'll be able to promote, during a failover, a replica that received the write: both Sentinel and Valkey Cluster will do a best-effort attempt to promote the best replica among the set of available replicas. +Specifically if a given write is transferred to one or more replicas, it is more likely (but not guaranteed) that if the primary fails, we'll be able to promote, during a failover, a replica that received the write: both Sentinel and Valkey Cluster will do a best-effort attempt to promote the best replica among the set of available replicas. However this is just a best-effort attempt so it is possible to still lose a write synchronously replicated to multiple replicas. Implementation details --- -Since the introduction of partial resynchronization with replicas (PSYNC feature) Valkey replicas asynchronously ping their master with the offset they already processed in the replication stream. This is used in multiple ways: +Since the introduction of partial resynchronization with replicas (PSYNC feature) Valkey replicas asynchronously ping their primary with the offset they already processed in the replication stream. This is used in multiple ways: 1. Detect timed out replicas. 2. Perform a partial resynchronization after a disconnection. diff --git a/commands/waitaof.md b/commands/waitaof.md index 0119c989..b9a86094 100644 --- a/commands/waitaof.md +++ b/commands/waitaof.md @@ -6,12 +6,12 @@ The value 0 disables this check. If the timeout, specified in milliseconds, is reached, the command returns even if the specified number of acknowledgments has not been met. -The command **will always return** the number of masters and replicas that have fsynced all write commands sent by the current client before the `WAITAOF` command, both in the case where the specified thresholds were met, and when the timeout is reached. +The command **will always return** the number of primaries and replicas that have fsynced all write commands sent by the current client before the `WAITAOF` command, both in the case where the specified thresholds were met, and when the timeout is reached. A few remarks: -1. When `WAITAOF` returns, all the previous write commands sent in the context of the current connection are guaranteed to be fsynced to the AOF of at least the number of masters and replicas returned by `WAITAOF`. -2. If the command is sent as part of a `MULTI` transaction (or any other context that does not allow blocking, such as inside scripts), the command does not block but instead returns immediately the number of masters and replicas that fsynced all previous write commands. +1. When `WAITAOF` returns, all the previous write commands sent in the context of the current connection are guaranteed to be fsynced to the AOF of at least the number of primaries and replicas returned by `WAITAOF`. +2. If the command is sent as part of a `MULTI` transaction (or any other context that does not allow blocking, such as inside scripts), the command does not block but instead returns immediately the number of primaries and replicas that fsynced all previous write commands. 3. A timeout of 0 means to block forever. 4. Since `WAITAOF` returns the number of fsyncs completed both in case of success and timeout, the client should check that the returned values are equal or greater than the persistence level required. 5. `WAITAOF` cannot be used on replica instances, and the `numlocal` argument cannot be non-zero if the local Valkey does not have AOF enabled. @@ -35,7 +35,7 @@ Implementation details Valkey tracks and increments the replication offset even when no replicas are configured (as long as AOF exists). -In addition, Valkey replicas asynchronously ping their master with two replication offsets: the offset they have processed in the replication stream, and the offset they have fsynced to their AOF. +In addition, Valkey replicas asynchronously ping their primary with two replication offsets: the offset they have processed in the replication stream, and the offset they have fsynced to their AOF. Valkey remembers, for each client, the replication offset of the produced replication stream when the last write command was executed in the context of that client. When `WAITAOF` is called, Valkey checks if the local Valkey and/or the specified number of replicas have confirmed fsyncing this offset or a greater one to their AOF. diff --git a/commands/xadd.md b/commands/xadd.md index 93957acc..e58f7312 100644 --- a/commands/xadd.md +++ b/commands/xadd.md @@ -45,7 +45,7 @@ are totally ordered inside a stream. In order to guarantee this property, if the current top ID in the stream has a time greater than the current local time of the instance, the top entry time will be used instead, and the sequence part of the ID incremented. This may happen when, for instance, -the local clock jumps backward, or if after a failover the new master has +the local clock jumps backward, or if after a failover the new primary has a different absolute time. When a user specified an explicit ID to `XADD`, the minimum valid ID is diff --git a/commands/xsetid.md b/commands/xsetid.md index 06ade90e..ff289d68 100644 --- a/commands/xsetid.md +++ b/commands/xsetid.md @@ -1,2 +1,2 @@ The `XSETID` command is an internal command. -It is used by a Valkey master to replicate the last delivered ID of streams. \ No newline at end of file +It is used by a Valkey primary to replicate the last delivered ID of streams. \ No newline at end of file diff --git a/topics/acl.md b/topics/acl.md index 80dceddc..59664321 100644 --- a/topics/acl.md +++ b/topics/acl.md @@ -532,7 +532,7 @@ In case you don't want to provide Valkey replicas and Valkey Sentinel instances full access to your Valkey instances, the following is the set of commands that must be allowed in order for everything to work correctly. -For Sentinel, allow the user to access the following commands both in the master and replica instances: +For Sentinel, allow the user to access the following commands both in the primary and replica instances: * AUTH, CLIENT, SUBSCRIBE, SCRIPT, PUBLISH, PING, INFO, MULTI, SLAVEOF, CONFIG, CLIENT, EXEC. @@ -540,7 +540,7 @@ Sentinel does not need to access any key in the database but does use Pub/Sub, s ACL SETUSER sentinel-user on >somepassword allchannels +multi +slaveof +ping +exec +subscribe +config|rewrite +role +publish +info +client|setname +client|kill +script|kill -Valkey replicas require the following commands to be allowed on the master instance: +Valkey replicas require the following commands to be allowed on the primary instance: * PSYNC, REPLCONF, PING @@ -548,4 +548,4 @@ No keys need to be accessed, so this translates to the following rules: ACL setuser replica-user on >somepassword +psync +replconf +ping -Note that you don't need to configure the replicas to allow the master to be able to execute any set of commands. The master is always authenticated as the root user from the point of view of replicas. +Note that you don't need to configure the replicas to allow the primary to be able to execute any set of commands. The primary is always authenticated as the root user from the point of view of replicas. diff --git a/topics/admin.md b/topics/admin.md index b7790a57..0733b11f 100644 --- a/topics/admin.md +++ b/topics/admin.md @@ -30,11 +30,11 @@ description: Advice for configuring and managing Valkey in production ### Replication -* Set up a non-trivial replication backlog in proportion to the amount of memory Valkey is using. The backlog allows replicas to sync with the primary (master) instance much more easily. +* Set up a non-trivial replication backlog in proportion to the amount of memory Valkey is using. The backlog allows replicas to sync with the primary instance much more easily. -* If you use replication, Valkey performs RDB saves even if persistence is disabled. (This does not apply to diskless replication.) If you don't have disk usage on the master, enable diskless replication. +* If you use replication, Valkey performs RDB saves even if persistence is disabled. (This does not apply to diskless replication.) If you don't have disk usage on the primary, enable diskless replication. -* If you are using replication, ensure that either your master has persistence enabled, or that it does not automatically restart on crashes. Replicas will try to maintain an exact copy of the master, so if a master restarts with an empty data set, replicas will be wiped as well. +* If you are using replication, ensure that either your primary has persistence enabled, or that it does not automatically restart on crashes. Replicas will try to maintain an exact copy of the primary, so if a primary restarts with an empty data set, replicas will be wiped as well. ### Security @@ -45,7 +45,7 @@ description: Advice for configuring and managing Valkey in production * Use HVM based instances, not PV based instances. * Do not use old instance families. For example, use m3.medium with HVM instead of m1.medium with PV. * The use of Valkey persistence with EC2 EBS volumes needs to be handled with care because sometimes EBS volumes have high latency characteristics. -* You may want to try the new diskless replication if you have issues when replicas are synchronizing with the master. +* You may want to try the new diskless replication if you have issues when replicas are synchronizing with the primary. ## Upgrading or restarting a Valkey instance without downtime @@ -57,16 +57,16 @@ Follow these steps to avoid downtime. * Set up your new Valkey instance as a replica for your current Valkey instance. In order to do so, you need a different server, or a server that has enough RAM to keep two instances of Valkey running at the same time. -* If you use a single server, ensure that the replica is started on a different port than the master instance, otherwise the replica cannot start. +* If you use a single server, ensure that the replica is started on a different port than the primary instance, otherwise the replica cannot start. * Wait for the replication initial synchronization to complete. Check the replica's log file. -* Using `INFO`, ensure the master and replica have the same number of keys. Use `valkey-cli` to check that the replica is working as expected and is replying to your commands. +* Using `INFO`, ensure the primary and replica have the same number of keys. Use `valkey-cli` to check that the replica is working as expected and is replying to your commands. -* Allow writes to the replica using `CONFIG SET slave-read-only no`. +* Allow writes to the replica using `CONFIG SET replica-read-only no`. -* Configure all your clients to use the new instance (the replica). Note that you may want to use the `CLIENT PAUSE` command to ensure that no client can write to the old master during the switch. +* Configure all your clients to use the new instance (the replica). Note that you may want to use the `CLIENT PAUSE` command to ensure that no client can write to the old primary during the switch. -* Once you confirm that the master is no longer receiving any queries (you can check this using the `MONITOR` command), elect the replica to master using the `REPLICAOF NO ONE` command, and then shut down your master. +* Once you confirm that the primary is no longer receiving any queries (you can check this using the `MONITOR` command), elect the replica to primary using the `REPLICAOF NO ONE` command, and then shut down your primary. -If you are using [Valkey Sentinel](sentinel.md) or [Valkey Cluster](cluster-tutorial.md), the simplest way to upgrade to newer versions is to upgrade one replica after the other. Then you can perform a manual failover to promote one of the upgraded replicas to master, and finally promote the last replica. +If you are using [Valkey Sentinel](sentinel.md) or [Valkey Cluster](cluster-tutorial.md), the simplest way to upgrade to newer versions is to upgrade one replica after the other. Then you can perform a manual failover to promote one of the upgraded replicas to primary, and finally promote the last replica. diff --git a/topics/cli.md b/topics/cli.md index 8a03c23f..bef95a3b 100644 --- a/topics/cli.md +++ b/topics/cli.md @@ -184,7 +184,7 @@ This topic covers the different aspects of `valkey-cli`, starting from the simpl : Simulate a cache workload with an 80-20 distribution. **`--replica`** -: Simulate a replica showing commands received from the master. +: Simulate a replica showing commands received from the primary. **`--rdb`** _filename_ : Transfer an RDB dump from remote server to local file. @@ -295,7 +295,7 @@ For management of [Valkey Cluster](cluster-tutorial.md), the following syntax is --cluster-threshold --cluster-replace add-node new_host:new_port existing_host:existing_port - --cluster-slave + --cluster-replica --cluster-master-id del-node host:port node_id call host:port command arg arg .. arg diff --git a/topics/clients.md b/topics/clients.md index 2f0c3af7..a94356b3 100644 --- a/topics/clients.md +++ b/topics/clients.md @@ -115,7 +115,7 @@ It disconnects the minimal number of clients needed to return below the `maxmemo `maxmemory-clients` defines the maximum aggregate memory usage of all clients connected to Valkey. The aggregation takes into account all the memory used by the client connections: the [query buffer](#query-buffer-hard-limit), the output buffer, and other intermediate buffers. -Note that replica and master connections aren't affected by the client eviction mechanism. Therefore, such connections are never evicted. +Note that replica and primary connections aren't affected by the client eviction mechanism. Therefore, such connections are never evicted. `maxmemory-clients` can be set permanently in the configuration file (`valkey.conf`) or via the `CONFIG SET` command. This setting can either be 0 (meaning no limit), a size in bytes (possibly with `mb`/`gb` suffix), diff --git a/topics/cluster-spec.md b/topics/cluster-spec.md index 96af83b9..2dcacf8d 100644 --- a/topics/cluster-spec.md +++ b/topics/cluster-spec.md @@ -17,8 +17,8 @@ of Valkey. Valkey Cluster is a distributed implementation of Valkey with the following goals in order of importance in the design: * High performance and linear scalability up to 1000 nodes. There are no proxies, asynchronous replication is used, and no merge operations are performed on values. -* Acceptable degree of write safety: the system tries (in a best-effort way) to retain all the writes originating from clients connected with the majority of the master nodes. Usually there are small windows where acknowledged writes can be lost. Windows to lose acknowledged writes are larger when clients are in a minority partition. -* Availability: Valkey Cluster is able to survive partitions where the majority of the master nodes are reachable and there is at least one reachable replica for every master node that is no longer reachable. Moreover using *replicas migration*, masters no longer replicated by any replica will receive one from a master which is covered by multiple replicas. +* Acceptable degree of write safety: the system tries (in a best-effort way) to retain all the writes originating from clients connected with the majority of the primary nodes. Usually there are small windows where acknowledged writes can be lost. Windows to lose acknowledged writes are larger when clients are in a minority partition. +* Availability: Valkey Cluster is able to survive partitions where the majority of the primary nodes are reachable and there is at least one reachable replica for every primary node that is no longer reachable. Moreover using *replicas migration*, primaries no longer replicated by any replica will receive one from a primary which is covered by multiple replicas. ### Implemented subset @@ -40,7 +40,7 @@ of Valkey. We only support database `0`; the `SELECT` command is not allowed. In Valkey Cluster, nodes are responsible for holding the data, and taking the state of the cluster, including mapping keys to the right nodes. Cluster nodes are also able to auto-discover other nodes, detect non-working -nodes, and promote replica nodes to master when needed in order +nodes, and promote replica nodes to primary when needed in order to continue to operate when a failure occurs. To perform their tasks all the cluster nodes are connected using a @@ -64,40 +64,40 @@ keys and nodes can improve the performance in a sensible way. ### Write safety -Valkey Cluster uses asynchronous replication between nodes, and **last failover wins** implicit merge function. This means that the last elected master dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of masters, and a client that is connected to the minority of masters. +Valkey Cluster uses asynchronous replication between nodes, and **last failover wins** implicit merge function. This means that the last elected primary dataset eventually replaces all the other replicas. There is always a window of time when it is possible to lose writes during partitions. However these windows are very different in the case of a client that is connected to the majority of primaries, and a client that is connected to the minority of primaries. -Valkey Cluster tries harder to retain writes that are performed by clients connected to the majority of masters, compared to writes performed in the minority side. +Valkey Cluster tries harder to retain writes that are performed by clients connected to the majority of primaries, compared to writes performed in the minority side. The following are examples of scenarios that lead to loss of acknowledged writes received in the majority partitions during failures: -1. A write may reach a master, but while the master may be able to reply to the client, the write may not be propagated to replicas via the asynchronous replication used between master and replica nodes. If the master dies without the write reaching the replicas, the write is lost forever if the master is unreachable for a long enough period that one of its replicas is promoted. This is usually hard to observe in the case of a total, sudden failure of a master node since masters try to reply to clients (with the acknowledge of the write) and replicas (propagating the write) at about the same time. However it is a real world failure mode. +1. A write may reach a primary, but while the primary may be able to reply to the client, the write may not be propagated to replicas via the asynchronous replication used between primary and replica nodes. If the primary dies without the write reaching the replicas, the write is lost forever if the primary is unreachable for a long enough period that one of its replicas is promoted. This is usually hard to observe in the case of a total, sudden failure of a primary node since primaries try to reply to clients (with the acknowledge of the write) and replicas (propagating the write) at about the same time. However it is a real world failure mode. 2. Another theoretically possible failure mode where writes are lost is the following: -* A master is unreachable because of a partition. +* A primary is unreachable because of a partition. * It gets failed over by one of its replicas. * After some time it may be reachable again. -* A client with an out-of-date routing table may write to the old master before it is converted into a replica (of the new master) by the cluster. +* A client with an out-of-date routing table may write to the old primary before it is converted into a replica (of the new primary) by the cluster. -The second failure mode is unlikely to happen because master nodes unable to communicate with the majority of the other masters for enough time to be failed over will no longer accept writes, and when the partition is fixed writes are still refused for a small amount of time to allow other nodes to inform about configuration changes. This failure mode also requires that the client's routing table has not yet been updated. +The second failure mode is unlikely to happen because primary nodes are unable to communicate with the majority of the other primaries for enough time to be failed over will no longer accept writes, and when the partition is fixed writes are still refused for a small amount of time to allow other nodes to inform about configuration changes. This failure mode also requires that the client's routing table has not yet been updated. -Writes targeting the minority side of a partition have a larger window in which to get lost. For example, Valkey Cluster loses a non-trivial number of writes on partitions where there is a minority of masters and at least one or more clients, since all the writes sent to the masters may potentially get lost if the masters are failed over in the majority side. +Writes targeting the minority side of a partition have a larger window in which to get lost. For example, Valkey Cluster loses a non-trivial number of writes on partitions where there is a minority of primaries and at least one or more clients, since all the writes sent to the primaries may potentially get lost if the primaries are failed over in the majority side. -Specifically, for a master to be failed over it must be unreachable by the majority of masters for at least `NODE_TIMEOUT`, so if the partition is fixed before that time, no writes are lost. When the partition lasts for more than `NODE_TIMEOUT`, all the writes performed in the minority side up to that point may be lost. However the minority side of a Valkey Cluster will start refusing writes as soon as `NODE_TIMEOUT` time has elapsed without contact with the majority, so there is a maximum window after which the minority becomes no longer available. Hence, no writes are accepted or lost after that time. +Specifically, for a primary to be failed over it must be unreachable by the majority of primaries for at least `NODE_TIMEOUT`, so if the partition is fixed before that time, no writes are lost. When the partition lasts for more than `NODE_TIMEOUT`, all the writes performed in the minority side up to that point may be lost. However the minority side of a Valkey Cluster will start refusing writes as soon as `NODE_TIMEOUT` time has elapsed without contact with the majority, so there is a maximum window after which the minority becomes no longer available. Hence, no writes are accepted or lost after that time. ### Availability -Valkey Cluster is not available in the minority side of the partition. In the majority side of the partition assuming that there are at least the majority of masters and a replica for every unreachable master, the cluster becomes available again after `NODE_TIMEOUT` time plus a few more seconds required for a replica to get elected and failover its master (failovers are usually executed in a matter of 1 or 2 seconds). +Valkey Cluster is not available in the minority side of the partition. In the majority side of the partition assuming that there are at least the majority of primaries and a replica for every unreachable primary, the cluster becomes available again after `NODE_TIMEOUT` time plus a few more seconds required for a replica to get elected and failover its primary (failovers are usually executed in a matter of 1 or 2 seconds). This means that Valkey Cluster is designed to survive failures of a few nodes in the cluster, but it is not a suitable solution for applications that require availability in the event of large net splits. -In the example of a cluster composed of N master nodes where every node has a single replica, the majority side of the cluster will remain available as long as a single node is partitioned away, and will remain available with a probability of `1-(1/(N*2-1))` when two nodes are partitioned away (after the first node fails we are left with `N*2-1` nodes in total, and the probability of the only master without a replica to fail is `1/(N*2-1))`. +In the example of a cluster composed of N primary nodes where every node has a single replica, the majority side of the cluster will remain available as long as a single node is partitioned away, and will remain available with a probability of `1-(1/(N*2-1))` when two nodes are partitioned away (after the first node fails we are left with `N*2-1` nodes in total, and the probability of the only primary without a replica to fail is `1/(N*2-1))`. For example, in a cluster with 5 nodes and a single replica per node, there is a `1/(5*2-1) = 11.11%` probability that after two nodes are partitioned away from the majority, the cluster will no longer be available. Thanks to a Valkey Cluster feature called **replicas migration** the Cluster availability is improved in many real world scenarios by the fact that -replicas migrate to orphaned masters (masters no longer having replicas). +replicas migrate to orphaned primaries (primaries no longer having replicas). So at every successful failure event, the cluster may reconfigure the replicas layout in order to better resist the next failure. @@ -111,7 +111,7 @@ Because of the use of asynchronous replication, nodes do not wait for other node Also, because multi-key commands are only limited to *near* keys, data is never moved between nodes except when resharding. -Normal operations are handled exactly as in the case of a single Valkey instance. This means that in a Valkey Cluster with N master nodes you can expect the same performance as a single Valkey instance multiplied by N as the design scales linearly. At the same time the query is usually performed in a single round trip, since clients usually retain persistent connections with the nodes, so latency figures are also the same as the single standalone Valkey node case. +Normal operations are handled exactly as in the case of a single Valkey instance. This means that in a Valkey Cluster with N primary nodes you can expect the same performance as a single Valkey instance multiplied by N as the design scales linearly. At the same time the query is usually performed in a single round trip, since clients usually retain persistent connections with the nodes, so latency figures are also the same as the single standalone Valkey node case. Very high performance and scalability while preserving weak but reasonable forms of data safety and availability is the main goal of @@ -132,10 +132,10 @@ non-clustered Valkey deployment. ### Key distribution model The cluster's key space is split into 16384 slots, effectively setting an upper limit -for the cluster size of 16384 master nodes (however, the suggested max size of +for the cluster size of 16384 primary nodes (however, the suggested max size of nodes is on the order of ~ 1000 nodes). -Each master node in a cluster handles a subset of the 16384 hash slots. +Each primary node in a cluster handles a subset of the 16384 hash slots. The cluster is **stable** when there is no cluster reconfiguration in progress (i.e. where hash slots are being moved from one node to another). When the cluster is stable, a single hash slot will be served by a single node @@ -275,7 +275,7 @@ a node was pinged, is instead local to each node. Every node maintains the following information about other nodes that it is aware of in the cluster: The node ID, IP and port of the node, a set of -flags, what is the master of the node if it is flagged as `replica`, last time +flags, what is the primary of the node if it is flagged as `replica`, last time the node was pinged and the last time the pong was received, the current *configuration epoch* of the node (explained later in this specification), the link state and finally the set of hash slots served. @@ -284,7 +284,7 @@ A detailed [explanation of all the node fields](../commands/cluster-nodes.md) is The `CLUSTER NODES` command can be sent to any node in the cluster and provides the state of the cluster and the information for each node according to the local view the queried node has of the cluster. -The following is sample output of the `CLUSTER NODES` command sent to a master +The following is sample output of the `CLUSTER NODES` command sent to a primary node in a small cluster of three nodes. $ valkey-cli cluster nodes @@ -436,7 +436,7 @@ The following subcommands are available (among others not useful in this case): The first four commands, `ADDSLOTS`, `DELSLOTS`, `ADDSLOTSRANGE` and `DELSLOTSRANGE`, are simply used to assign (or remove) slots to a Valkey node. Assigning a slot means to tell a given -master node that it will be in charge of storing and serving content for +primary node that it will be in charge of storing and serving content for the specified hash slot. After the hash slots are assigned they will propagate across the cluster @@ -444,7 +444,7 @@ using the gossip protocol, as specified later in the *configuration propagation* section. The `ADDSLOTS` and `ADDSLOTSRANGE` commands are usually used when a new cluster is created -from scratch to assign each master node a subset of all the 16384 hash +from scratch to assign each primary node a subset of all the 16384 hash slots available. The `DELSLOTS` and `DELSLOTSRANGE` are mainly used for manual modification of a cluster configuration @@ -466,7 +466,7 @@ by the client, the query is redirected to the real hash slot owner via a `-MOVED` redirection error, as would happen normally. Let's make this clearer with an example of hash slot migration. -Assume that we have two Valkey master nodes, called A and B. +Assume that we have two Valkey primary nodes, called A and B. We want to move hash slot 8 from A to B, so we issue commands like this: * We send B: CLUSTER SETSLOT 8 IMPORTING A @@ -606,12 +606,12 @@ addresses in two different situations: Note that a client may handle the `MOVED` redirection by updating just the moved slot in its table; however this is usually not efficient because often the configuration of multiple slots will be modified at once. For example, if a -replica is promoted to master, all of the slots served by the old master will +replica is promoted to primary, all of the slots served by the old primary will be remapped). It is much simpler to react to a `MOVED` redirection by fetching the full map of slots to nodes from scratch. Client can issue a `CLUSTER SLOTS` command to retrieve an array of slot -ranges and the associated master and replica nodes serving the specified ranges. +ranges and the associated primary and replica nodes serving the specified ranges. The following is an example of output of `CLUSTER SLOTS`: @@ -639,7 +639,7 @@ The following is an example of output of `CLUSTER SLOTS`: The first two sub-elements of every element of the returned array are the start and end slots of the range. The additional elements represent address-port -pairs. The first address-port pair is the master serving the slot, and the +pairs. The first address-port pair is the primary serving the slot, and the additional address-port pairs are the replicas serving the same slot. Replicas will be listed only when not in an error condition (i.e., when their FAIL flag is not set). @@ -680,7 +680,7 @@ multi-key operations are available again for that hash slot. ### Scaling reads using replica nodes -Normally replica nodes will redirect clients to the authoritative master for +Normally replica nodes will redirect clients to the authoritative primary for the hash slot involved in a given command, however clients can use replicas in order to scale reads using the `READONLY` command. @@ -689,9 +689,9 @@ possibly stale data and is not interested in running write queries. When the connection is in readonly mode, the cluster will send a redirection to the client only if the operation involves keys not served -by the replica's master node. This may happen because: +by the replica's primary node. This may happen because: -1. The client sent a command about hash slots never served by the master of this replica. +1. The client sent a command about hash slots never served by the primary of this replica. 2. The cluster was reconfigured (for example resharded) and the replica is no longer able to serve commands for a given hash slot. When this happens the client should update its hash slot map as explained in @@ -729,13 +729,13 @@ Ping and pong packets contain a header that is common to all types of packets (f The common header has the following information: * Node ID, a 160 bit pseudorandom string that is assigned the first time a node is created and remains the same for all the life of a Valkey Cluster node. -* The `currentEpoch` and `configEpoch` fields of the sending node that are used to mount the distributed algorithms used by Valkey Cluster (this is explained in detail in the next sections). If the node is a replica the `configEpoch` is the last known `configEpoch` of its master. -* The node flags, indicating if the node is a replica, a master, and other single-bit node information. -* A bitmap of the hash slots served by the sending node, or if the node is a replica, a bitmap of the slots served by its master. +* The `currentEpoch` and `configEpoch` fields of the sending node that are used to mount the distributed algorithms used by Valkey Cluster (this is explained in detail in the next sections). If the node is a replica the `configEpoch` is the last known `configEpoch` of its primary. +* The node flags, indicating if the node is a replica, a primary, and other single-bit node information. +* A bitmap of the hash slots served by the sending node, or if the node is a replica, a bitmap of the slots served by its primary. * The sender TCP base port that is the port used by Valkey to accept client commands. * The cluster port that is the port used by Valkey for node-to-node communication. * The state of the cluster from the point of view of the sender (down or ok). -* The master node ID of the sending node, if it is a replica. +* The primary node ID of the sending node, if it is a replica. Ping and pong packets also contain a gossip section. This section offers to the receiver a view of what the sender node thinks about other nodes in the cluster. The gossip section only contains information about a few random nodes among the set of nodes known to the sender. The number of nodes mentioned in a gossip section is proportional to the cluster size. @@ -749,13 +749,13 @@ Gossip sections allow receiving nodes to get information about the state of othe ### Failure detection -Valkey Cluster failure detection is used to recognize when a master or replica node is no longer reachable by the majority of nodes and then respond by promoting a replica to the role of master. When replica promotion is not possible the cluster is put in an error state to stop receiving queries from clients. +Valkey Cluster failure detection is used to recognize when a primary or replica node is no longer reachable by the majority of nodes and then respond by promoting a replica to the role of primary. When replica promotion is not possible the cluster is put in an error state to stop receiving queries from clients. -As already mentioned, every node takes a list of flags associated with other known nodes. There are two flags that are used for failure detection that are called `PFAIL` and `FAIL`. `PFAIL` means *Possible failure*, and is a non-acknowledged failure type. `FAIL` means that a node is failing and that this condition was confirmed by a majority of masters within a fixed amount of time. +As already mentioned, every node takes a list of flags associated with other known nodes. There are two flags that are used for failure detection that are called `PFAIL` and `FAIL`. `PFAIL` means *Possible failure*, and is a non-acknowledged failure type. `FAIL` means that a node is failing and that this condition was confirmed by a majority of primaries within a fixed amount of time. **PFAIL flag:** -A node flags another node with the `PFAIL` flag when the node is not reachable for more than `NODE_TIMEOUT` time. Both master and replica nodes can flag another node as `PFAIL`, regardless of its type. +A node flags another node with the `PFAIL` flag when the node is not reachable for more than `NODE_TIMEOUT` time. Both primary and replica nodes can flag another node as `PFAIL`, regardless of its type. The concept of non-reachability for a Valkey Cluster node is that we have an **active ping** (a ping that we sent for which we have yet to get a reply) pending for longer than `NODE_TIMEOUT`. For this mechanism to work the `NODE_TIMEOUT` must be large compared to the network round trip time. In order to add reliability during normal operations, nodes will try to reconnect with other nodes in the cluster as soon as half of the `NODE_TIMEOUT` has elapsed without a reply to a ping. This mechanism ensures that connections are kept alive so broken connections usually won't result in false failure reports between nodes. @@ -768,8 +768,8 @@ As outlined in the node heartbeats section of this document, every node sends go A `PFAIL` condition is escalated to a `FAIL` condition when the following set of conditions are met: * Some node, that we'll call A, has another node B flagged as `PFAIL`. -* Node A collected, via gossip sections, information about the state of B from the point of view of the majority of masters in the cluster. -* The majority of masters signaled the `PFAIL` or `FAIL` condition within `NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT` time. (The validity factor is set to 2 in the current implementation, so this is just two times the `NODE_TIMEOUT` time). +* Node A collected, via gossip sections, information about the state of B from the point of view of the majority of primaries in the cluster. +* The majority of primaries signaled the `PFAIL` or `FAIL` condition within `NODE_TIMEOUT * FAIL_REPORT_VALIDITY_MULT` time. (The validity factor is set to 2 in the current implementation, so this is just two times the `NODE_TIMEOUT` time). If all the above conditions are true, Node A will: @@ -781,21 +781,21 @@ The `FAIL` message will force every receiving node to mark the node in `FAIL` st Note that *the FAIL flag is mostly one way*. That is, a node can go from `PFAIL` to `FAIL`, but a `FAIL` flag can only be cleared in the following situations: * The node is already reachable and is a replica. In this case the `FAIL` flag can be cleared as replicas are not failed over. -* The node is already reachable and is a master not serving any slot. In this case the `FAIL` flag can be cleared as masters without slots do not really participate in the cluster and are waiting to be configured in order to join the cluster. -* The node is already reachable and is a master, but a long time (N times the `NODE_TIMEOUT`) has elapsed without any detectable replica promotion. It's better for it to rejoin the cluster and continue in this case. +* The node is already reachable and is a primary not serving any slot. In this case the `FAIL` flag can be cleared as primaries without slots do not really participate in the cluster and are waiting to be configured in order to join the cluster. +* The node is already reachable and is a primary, but a long time (N times the `NODE_TIMEOUT`) has elapsed without any detectable replica promotion. It's better for it to rejoin the cluster and continue in this case. It is useful to note that while the `PFAIL` -> `FAIL` transition uses a form of agreement, the agreement used is weak: -1. Nodes collect views of other nodes over some time period, so even if the majority of master nodes need to "agree", actually this is just state that we collected from different nodes at different times and we are not sure, nor we require, that at a given moment the majority of masters agreed. However we discard failure reports which are old, so the failure was signaled by the majority of masters within a window of time. +1. Nodes collect views of other nodes over some time period, so even if the majority of primary nodes need to "agree", actually this is just state that we collected from different nodes at different times and we are not sure, nor we require, that at a given moment the majority of primaries agreed. However we discard failure reports which are old, so the failure was signaled by the majority of primaries within a window of time. 2. While every node detecting the `FAIL` condition will force that condition on other nodes in the cluster using the `FAIL` message, there is no way to ensure the message will reach all the nodes. For instance a node may detect the `FAIL` condition and because of a partition will not be able to reach any other node. However the Valkey Cluster failure detection has a liveness requirement: eventually all the nodes should agree about the state of a given node. There are two cases that can originate from split brain conditions. Either some minority of nodes believe the node is in `FAIL` state, or a minority of nodes believe the node is not in `FAIL` state. In both the cases eventually the cluster will have a single view of the state of a given node: -**Case 1**: If a majority of masters have flagged a node as `FAIL`, because of failure detection and the *chain effect* it generates, every other node will eventually flag the master as `FAIL`, since in the specified window of time enough failures will be reported. +**Case 1**: If a majority of primaries have flagged a node as `FAIL`, because of failure detection and the *chain effect* it generates, every other node will eventually flag the primary as `FAIL`, since in the specified window of time enough failures will be reported. -**Case 2**: When only a minority of masters have flagged a node as `FAIL`, the replica promotion will not happen (as it uses a more formal algorithm that makes sure everybody knows about the promotion eventually) and every node will clear the `FAIL` state as per the `FAIL` state clearing rules above (i.e. no promotion after N times the `NODE_TIMEOUT` has elapsed). +**Case 2**: When only a minority of primaries have flagged a node as `FAIL`, the replica promotion will not happen (as it uses a more formal algorithm that makes sure everybody knows about the promotion eventually) and every node will clear the `FAIL` state as per the `FAIL` state clearing rules above (i.e. no promotion after N times the `NODE_TIMEOUT` has elapsed). -**The `FAIL` flag is only used as a trigger to run the safe part of the algorithm** for the replica promotion. In theory a replica may act independently and start a replica promotion when its master is not reachable, and wait for the masters to refuse to provide the acknowledgment if the master is actually reachable by the majority. However the added complexity of the `PFAIL -> FAIL` state, the weak agreement, and the `FAIL` message forcing the propagation of the state in the shortest amount of time in the reachable part of the cluster, have practical advantages. Because of these mechanisms, usually all the nodes will stop accepting writes at about the same time if the cluster is in an error state. This is a desirable feature from the point of view of applications using Valkey Cluster. Also erroneous election attempts initiated by replicas that can't reach its master due to local problems (the master is otherwise reachable by the majority of other master nodes) are avoided. +**The `FAIL` flag is only used as a trigger to run the safe part of the algorithm** for the replica promotion. In theory a replica may act independently and start a replica promotion when its primary is not reachable, and wait for the primaries to refuse to provide the acknowledgment if the primary is actually reachable by the majority. However the added complexity of the `PFAIL -> FAIL` state, the weak agreement, and the `FAIL` message forcing the propagation of the state in the shortest amount of time in the reachable part of the cluster, have practical advantages. Because of these mechanisms, usually all the nodes will stop accepting writes at about the same time if the cluster is in an error state. This is a desirable feature from the point of view of applications using Valkey Cluster. Also erroneous election attempts initiated by replicas that can't reach its primary due to local problems (the primary is otherwise reachable by the majority of other primary nodes) are avoided. ## Configuration handling, propagation, and failovers @@ -805,7 +805,7 @@ Valkey Cluster uses a concept similar to the Raft algorithm "term". In Valkey Cl The `currentEpoch` is a 64 bit unsigned number. -At node creation every Valkey Cluster node, both replicas and master nodes, set the `currentEpoch` to 0. +At node creation every Valkey Cluster node, both replicas and primary nodes, set the `currentEpoch` to 0. Every time a packet is received from another node, if the epoch of the sender (part of the cluster bus messages header) is greater than the local node epoch, the `currentEpoch` is updated to the sender epoch. @@ -817,18 +817,18 @@ Currently this happens only during replica promotion, as described in the next s ### Configuration epoch -Every master always advertises its `configEpoch` in ping and pong packets along with a bitmap advertising the set of slots it serves. +Every primary always advertises its `configEpoch` in ping and pong packets along with a bitmap advertising the set of slots it serves. -The `configEpoch` is set to zero in masters when a new node is created. +The `configEpoch` is set to zero in primaries when a new node is created. -A new `configEpoch` is created during replica election. replicas trying to replace -failing masters increment their epoch and try to get authorization from -a majority of masters. When a replica is authorized, a new unique `configEpoch` -is created and the replica turns into a master using the new `configEpoch`. +A new `configEpoch` is created during replica election. Replicas trying to replace +failing primaries increment their epoch and try to get authorization from +a majority of primaries. When a replica is authorized, a new unique `configEpoch` +is created and the replica turns into a primary using the new `configEpoch`. As explained in the next sections the `configEpoch` helps to resolve conflicts when different nodes claim divergent configurations (a condition that may happen because of network partitions and node failures). -replica nodes also advertise the `configEpoch` field in ping and pong packets, but in the case of replicas the field represents the `configEpoch` of its master as of the last time they exchanged packets. This allows other instances to detect when a replica has an old configuration that needs to be updated (master nodes will not grant votes to replicas with an old configuration). +Replica nodes also advertise the `configEpoch` field in ping and pong packets, but in the case of replicas the field represents the `configEpoch` of its primary as of the last time they exchanged packets. This allows other instances to detect when a replica has an old configuration that needs to be updated (primary nodes will not grant votes to replicas with an old configuration). Every time the `configEpoch` changes for some known node, it is permanently stored in the nodes.conf file by all the nodes that receive this information. The same also happens for the `currentEpoch` value. These two variables are guaranteed to be saved and `fsync-ed` to disk when updated before a node continues its operations. @@ -837,93 +837,93 @@ are guaranteed to be new, incremental, and unique. ### Replica election and promotion -Replica election and promotion is handled by replica nodes, with the help of master nodes that vote for the replica to promote. -A replica election happens when a master is in `FAIL` state from the point of view of at least one of its replicas that has the prerequisites in order to become a master. +Replica election and promotion is handled by replica nodes, with the help of the primary nodes that vote for the replica to promote. +A replica election happens when a primary is in `FAIL` state from the point of view of at least one of its replicas that has the prerequisites in order to become a primary. -In order for a replica to promote itself to master, it needs to start an election and win it. All the replicas for a given master can start an election if the master is in `FAIL` state, however only one replica will win the election and promote itself to master. +In order for a replica to promote itself to primary, it needs to start an election and win it. All the replicas for a given primary can start an election if the primary is in `FAIL` state, however only one replica will win the election and promote itself to primary. A replica starts an election when the following conditions are met: -* The replica's master is in `FAIL` state. -* The master was serving a non-zero number of slots. -* The replica replication link was disconnected from the master for no longer than a given amount of time, in order to ensure the promoted replica's data is reasonably fresh. This time is user configurable. +* The replica's primary is in `FAIL` state. +* The primary was serving a non-zero number of slots. +* The replica replication link was disconnected from the primary for no longer than a given amount of time, in order to ensure the promoted replica's data is reasonably fresh. This time is user configurable. -In order to be elected, the first step for a replica is to increment its `currentEpoch` counter, and request votes from master instances. +In order to be elected, the first step for a replica is to increment its `currentEpoch` counter, and request votes from primary instances. -Votes are requested by the replica by broadcasting a `FAILOVER_AUTH_REQUEST` packet to every master node of the cluster. Then it waits for a maximum time of two times the `NODE_TIMEOUT` for replies to arrive (but always for at least 2 seconds). +Votes are requested by the replica by broadcasting a `FAILOVER_AUTH_REQUEST` packet to every primary node of the cluster. Then it waits for a maximum time of two times the `NODE_TIMEOUT` for replies to arrive (but always for at least 2 seconds). -Once a master has voted for a given replica, replying positively with a `FAILOVER_AUTH_ACK`, it can no longer vote for another replica of the same master for a period of `NODE_TIMEOUT * 2`. In this period it will not be able to reply to other authorization requests for the same master. This is not needed to guarantee safety, but useful for preventing multiple replicas from getting elected (even if with a different `configEpoch`) at around the same time, which is usually not wanted. +Once a primary has voted for a given replica, replying positively with a `FAILOVER_AUTH_ACK`, it can no longer vote for another replica of the same primary for a period of `NODE_TIMEOUT * 2`. In this period it will not be able to reply to other authorization requests for the same primary. This is not needed to guarantee safety, but useful for preventing multiple replicas from getting elected (even if with a different `configEpoch`) at around the same time, which is usually not wanted. A replica discards any `AUTH_ACK` replies with an epoch that is less than the `currentEpoch` at the time the vote request was sent. This ensures it doesn't count votes intended for a previous election. -Once the replica receives ACKs from the majority of masters, it wins the election. +Once the replica receives ACKs from the majority of primaries, it wins the election. Otherwise if the majority is not reached within the period of two times `NODE_TIMEOUT` (but always at least 2 seconds), the election is aborted and a new one will be tried again after `NODE_TIMEOUT * 4` (and always at least 4 seconds). ### Replica rank -As soon as a master is in `FAIL` state, a replica waits a short period of time before trying to get elected. That delay is computed as follows: +As soon as a primary is in `FAIL` state, a replica waits a short period of time before trying to get elected. That delay is computed as follows: DELAY = 500 milliseconds + random delay between 0 and 500 milliseconds + REPLICA_RANK * 1000 milliseconds. -The fixed delay ensures that we wait for the `FAIL` state to propagate across the cluster, otherwise the replica may try to get elected while the masters are still unaware of the `FAIL` state, refusing to grant their vote. +The fixed delay ensures that we wait for the `FAIL` state to propagate across the cluster, otherwise the replica may try to get elected while the primaries are still unaware of the `FAIL` state, refusing to grant their vote. The random delay is used to desynchronize replicas so they're unlikely to start an election at the same time. -The `REPLICA_RANK` is the rank of this replica regarding the amount of replication data it has processed from the master. -Replicas exchange messages when the master is failing in order to establish a (best effort) rank: +The `REPLICA_RANK` is the rank of this replica regarding the amount of replication data it has processed from the primary. +Replicas exchange messages when the primary is failing in order to establish a (best effort) rank: the replica with the most updated replication offset is at rank 0, the second most updated at rank 1, and so forth. In this way the most updated replicas try to get elected before others. Rank order is not strictly enforced; if a replica of higher rank fails to be elected, the others will try shortly. -Once a replica wins the election, it obtains a new unique and incremental `configEpoch` which is higher than that of any other existing master. It starts advertising itself as master in ping and pong packets, providing the set of served slots with a `configEpoch` that will win over the past ones. +Once a replica wins the election, it obtains a new unique and incremental `configEpoch` which is higher than that of any other existing primary. It starts advertising itself as primary in ping and pong packets, providing the set of served slots with a `configEpoch` that will win over the past ones. In order to speedup the reconfiguration of other nodes, a pong packet is broadcast to all the nodes of the cluster. Currently unreachable nodes will eventually be reconfigured when they receive a ping or pong packet from another node or will receive an `UPDATE` packet from another node if the information it publishes via heartbeat packets are detected to be out of date. -The other nodes will detect that there is a new master serving the same slots served by the old master but with a greater `configEpoch`, and will upgrade their configuration. Replicas of the old master (or the failed over master if it rejoins the cluster) will not just upgrade the configuration but will also reconfigure to replicate from the new master. How nodes rejoining the cluster are configured is explained in the next sections. +The other nodes will detect that there is a new primary serving the same slots served by the old primary but with a greater `configEpoch`, and will upgrade their configuration. Replicas of the old primary (or the failed over primary if it rejoins the cluster) will not just upgrade the configuration but will also reconfigure to replicate from the new primary. How nodes rejoining the cluster are configured is explained in the next sections. ### Masters reply to replica vote request -In the previous section, we discussed how replicas try to get elected. This section explains what happens from the point of view of a master that is requested to vote for a given replica. +In the previous section, we discussed how replicas try to get elected. This section explains what happens from the point of view of a primary that is requested to vote for a given replica. Masters receive requests for votes in form of `FAILOVER_AUTH_REQUEST` requests from replicas. For a vote to be granted the following conditions need to be met: -1. A master only votes a single time for a given epoch, and refuses to vote for older epochs: every master has a lastVoteEpoch field and will refuse to vote again as long as the `currentEpoch` in the auth request packet is not greater than the lastVoteEpoch. When a master replies positively to a vote request, the lastVoteEpoch is updated accordingly, and safely stored on disk. -2. A master votes for a replica only if the replica's master is flagged as `FAIL`. -3. Auth requests with a `currentEpoch` that is less than the master `currentEpoch` are ignored. Because of this the master reply will always have the same `currentEpoch` as the auth request. If the same replica asks again to be voted, incrementing the `currentEpoch`, it is guaranteed that an old delayed reply from the master can not be accepted for the new vote. +1. A primary only votes a single time for a given epoch, and refuses to vote for older epochs: every primary has a lastVoteEpoch field and will refuse to vote again as long as the `currentEpoch` in the auth request packet is not greater than the lastVoteEpoch. When a primary replies positively to a vote request, the lastVoteEpoch is updated accordingly, and safely stored on disk. +2. A primary votes for a replica only if the replica's primary is flagged as `FAIL`. +3. Auth requests with a `currentEpoch` that is less than the primary `currentEpoch` are ignored. Because of this the primary reply will always have the same `currentEpoch` as the auth request. If the same replica asks again to be voted, incrementing the `currentEpoch`, it is guaranteed that an old delayed reply from the primary can not be accepted for the new vote. Example of the issue caused by not using rule number 3: -Master `currentEpoch` is 5, lastVoteEpoch is 1 (this may happen after a few failed elections) +Primary `currentEpoch` is 5, lastVoteEpoch is 1 (this may happen after a few failed elections) * Replica `currentEpoch` is 3. -* Replica tries to be elected with epoch 4 (3+1), master replies with an ok with `currentEpoch` 5, however the reply is delayed. +* Replica tries to be elected with epoch 4 (3+1), primary replies with an ok with `currentEpoch` 5, however the reply is delayed. * Replica will try to be elected again, at a later time, with epoch 5 (4+1), the delayed reply reaches the replica with `currentEpoch` 5, and is accepted as valid. -4. Masters don't vote for a replica of the same master before `NODE_TIMEOUT * 2` has elapsed if a replica of that master was already voted for. This is not strictly required as it is not possible for two replicas to win the election in the same epoch. However, in practical terms it ensures that when a replica is elected it has plenty of time to inform the other replicas and avoid the possibility that another replica will win a new election, performing an unnecessary second failover. -5. Masters make no effort to select the best replica in any way. If the replica's master is in `FAIL` state and the master did not vote in the current term, a positive vote is granted. The best replica is the most likely to start an election and win it before the other replicas, since it will usually be able to start the voting process earlier because of its *higher rank* as explained in the previous section. -6. When a master refuses to vote for a given replica there is no negative response, the request is simply ignored. -7. Masters don't vote for replicas sending a `configEpoch` that is less than any `configEpoch` in the master table for the slots claimed by the replica. Remember that the replica sends the `configEpoch` of its master, and the bitmap of the slots served by its master. This means that the replica requesting the vote must have a configuration for the slots it wants to failover that is newer or equal the one of the master granting the vote. +4. Primaries don't vote for a replica of the same primary before `NODE_TIMEOUT * 2` has elapsed if a replica of that primary was already voted for. This is not strictly required as it is not possible for two replicas to win the election in the same epoch. However, in practical terms it ensures that when a replica is elected it has plenty of time to inform the other replicas and avoid the possibility that another replica will win a new election, performing an unnecessary second failover. +5. Primaries make no effort to select the best replica in any way. If the replica's primary is in `FAIL` state and the primary did not vote in the current term, a positive vote is granted. The best replica is the most likely to start an election and win it before the other replicas, since it will usually be able to start the voting process earlier because of its *higher rank* as explained in the previous section. +6. When a primary refuses to vote for a given replica there is no negative response, the request is simply ignored. +7. Primaries don't vote for replicas sending a `configEpoch` that is less than any `configEpoch` in the primary table for the slots claimed by the replica. Remember that the replica sends the `configEpoch` of its primary, and the bitmap of the slots served by its primary. This means that the replica requesting the vote must have a configuration for the slots it wants to failover that is newer or equal the one of the primary granting the vote. ### Practical example of configuration epoch usefulness during partitions This section illustrates how the epoch concept is used to make the replica promotion process more resistant to partitions. -* A master is no longer reachable indefinitely. The master has three replicas A, B, C. -* Replica A wins the election and is promoted to master. +* A primary is no longer reachable indefinitely. The primary has three replicas A, B, C. +* Replica A wins the election and is promoted to primary. * A network partition makes A not available for the majority of the cluster. -* Replica B wins the election and is promoted as master. +* Replica B wins the election and is promoted as primary. * A partition makes B not available for the majority of the cluster. * The previous partition is fixed, and A is available again. -At this point B is down and A is available again with a role of master (actually `UPDATE` messages would reconfigure it promptly, but here we assume all `UPDATE` messages were lost). At the same time, replica C will try to get elected in order to fail over B. This is what happens: +At this point B is down and A is available again with a role of primary (actually `UPDATE` messages would reconfigure it promptly, but here we assume all `UPDATE` messages were lost). At the same time, replica C will try to get elected in order to fail over B. This is what happens: -1. C will try to get elected and will succeed, since for the majority of masters its master is actually down. It will obtain a new incremental `configEpoch`. -2. A will not be able to claim to be the master for its hash slots, because the other nodes already have the same hash slots associated with a higher configuration epoch (the one of B) compared to the one published by A. +1. C will try to get elected and will succeed, since for the majority of primaries its primary is actually down. It will obtain a new incremental `configEpoch`. +2. A will not be able to claim to be the primary for its hash slots, because the other nodes already have the same hash slots associated with a higher configuration epoch (the one of B) compared to the one published by A. 3. So, all the nodes will upgrade their table to assign the hash slots to C, and the cluster will continue its operations. As you'll see in the next sections, a stale node rejoining a cluster @@ -933,14 +933,14 @@ has stale information and will send an `UPDATE` message. ### Hash slots configuration propagation -An important part of Valkey Cluster is the mechanism used to propagate the information about which cluster node is serving a given set of hash slots. This is vital to both the startup of a fresh cluster and the ability to upgrade the configuration after a replica was promoted to serve the slots of its failing master. +An important part of Valkey Cluster is the mechanism used to propagate the information about which cluster node is serving a given set of hash slots. This is vital to both the startup of a fresh cluster and the ability to upgrade the configuration after a replica was promoted to serve the slots of its failing primary. The same mechanism allows nodes partitioned away for an indefinite amount of time to rejoin the cluster in a sensible way. There are two ways hash slot configurations are propagated: -1. Heartbeat messages. The sender of a ping or pong packet always adds information about the set of hash slots it (or its master, if it is a replica) serves. +1. Heartbeat messages. The sender of a ping or pong packet always adds information about the set of hash slots it (or its primary, if it is a replica) serves. 2. `UPDATE` messages. Since in every heartbeat packet there is information about the sender `configEpoch` and set of hash slots served, if a receiver of a heartbeat packet finds the sender information is stale, it will send a packet with new information, forcing the stale node to update its info. The receiver of a heartbeat or `UPDATE` message uses certain simple rules in @@ -968,23 +968,23 @@ So if we receive a heartbeat from node A claiming to serve hash slots 1 and 2 wi 16383 -> NULL ``` -When a new cluster is created, a system administrator needs to manually assign (using the `CLUSTER ADDSLOTS` command, via the valkey-cli command line tool, or by any other means) the slots served by each master node only to the node itself, and the information will rapidly propagate across the cluster. +When a new cluster is created, a system administrator needs to manually assign (using the `CLUSTER ADDSLOTS` command, via the valkey-cli command line tool, or by any other means) the slots served by each primary node only to the node itself, and the information will rapidly propagate across the cluster. However this rule is not enough. We know that hash slot mapping can change during two events: -1. A replica replaces its master during a failover. +1. A replica replaces its primary during a failover. 2. A slot is resharded from a node to a different one. -For now let's focus on failovers. When a replica fails over its master, it obtains +For now let's focus on failovers. When a replica fails over its primary, it obtains a configuration epoch which is guaranteed to be greater than the one of its -master (and more generally greater than any other configuration epoch +primary (and more generally greater than any other configuration epoch generated previously). For example node B, which is a replica of A, may failover A with configuration epoch of 4. It will start to send heartbeat packets (the first time mass-broadcasting cluster-wide) and because of the following second rule, receivers will update their hash slot tables: -**Rule 2**: If a hash slot is already assigned, and a known node is advertising it using a `configEpoch` that is greater than the `configEpoch` of the master currently associated with the slot, I'll rebind the hash slot to the new node. +**Rule 2**: If a hash slot is already assigned, and a known node is advertising it using a `configEpoch` that is greater than the `configEpoch` of the primary currently associated with the slot, it'll rebind the hash slot to the new node. So after receiving messages from B that claim to serve hash slots 1 and 2 with configuration epoch of 4, the receivers will update their table in the following way: @@ -1021,57 +1021,57 @@ The same basic mechanism is used when a node rejoins a cluster. Continuing with the example above, node A will be notified that hash slots 1 and 2 are now served by B. Assuming that these two were the only hash slots served by A, the count of hash slots served by A will -drop to 0! So A will **reconfigure to be a replica of the new master**. +drop to 0! So A will **reconfigure to be a replica of the new primary**. The actual rule followed is a bit more complex than this. In general it may happen that A rejoins after a lot of time, in the meantime it may happen that hash slots originally served by A are served by multiple nodes, for example hash slot 1 may be served by B, and hash slot 2 by C. -So the actual *Valkey Cluster node role switch rule* is: **A master node will change its configuration to replicate (be a replica of) the node that stole its last hash slot**. +So the actual *Valkey Cluster node role switch rule* is: **A primary node will change its configuration to replicate (be a replica of) the node that stole its last hash slot**. -During reconfiguration, eventually the number of served hash slots will drop to zero, and the node will reconfigure accordingly. Note that in the base case this just means that the old master will be a replica of the replica that replaced it after a failover. However in the general form the rule covers all possible cases. +During reconfiguration, eventually the number of served hash slots will drop to zero, and the node will reconfigure accordingly. Note that in the base case this just means that the old primary will be a replica of the replica that replaced it after a failover. However in the general form the rule covers all possible cases. Replicas do exactly the same: they reconfigure to replicate the node that -stole the last hash slot of its former master. +stole the last hash slot of its former primary. ### Replica migration Valkey Cluster implements a concept called *replica migration* in order to improve the availability of the system. The idea is that in a cluster with -a master-replica setup, if the map between replicas and masters is fixed +a primary-replica setup, if the map between replicas and primaries is fixed availability is limited over time if multiple independent failures of single nodes happen. -For example in a cluster where every master has a single replica, the cluster -can continue operations as long as either the master or the replica fail, but not +For example in a cluster where every primary has a single replica, the cluster +can continue operations as long as either the primary or the replica fail, but not if both fail the same time. However there is a class of failures that are the independent failures of single nodes caused by hardware or software issues that can accumulate over time. For example: * Master A has a single replica A1. -* Master A fails. A1 is promoted as new master. +* Master A fails. A1 is promoted as new primary. * Three hours later A1 fails in an independent manner (unrelated to the failure of A). No other replica is available for promotion since node A is still down. The cluster cannot continue normal operations. -If the map between masters and replicas is fixed, the only way to make the cluster -more resistant to the above scenario is to add replicas to every master, however +If the map between primaries and replicas is fixed, the only way to make the cluster +more resistant to the above scenario is to add replicas to every primary, however this is costly as it requires more instances of Valkey to be executed, more memory, and so forth. An alternative is to create an asymmetry in the cluster, and let the cluster layout automatically change over time. For example the cluster may have three -masters A, B, C. A and B have a single replica each, A1 and B1. However the master +primaries A, B, C. A and B have a single replica each, A1 and B1. However, the primary C is different and has two replicas: C1 and C2. Replica migration is the process of automatic reconfiguration of a replica -in order to *migrate* to a master that has no longer coverage (no working +in order to *migrate* to a primary that has no longer coverage (no working replicas). With replica migration the scenario mentioned above turns into the following: * Master A fails. A1 is promoted. * C2 migrates as replica of A1, that is otherwise not backed by any replica. * Three hours later A1 fails as well. -* C2 is promoted as new master to replace A1. +* C2 is promoted as a new primary to replace A1. * The cluster can continue the operations. ### Replica migration algorithm @@ -1079,43 +1079,43 @@ following: The migration algorithm does not use any form of agreement since the replica layout in a Valkey Cluster is not part of the cluster configuration that needs to be consistent and/or versioned with config epochs. Instead it uses an -algorithm to avoid mass-migration of replicas when a master is not backed. +algorithm to avoid mass-migration of replicas when a primary is not backed. The algorithm guarantees that eventually (once the cluster configuration is -stable) every master will be backed by at least one replica. +stable) every primary will be backed by at least one replica. This is how the algorithm works. To start we need to define what is a *good replica* in this context: a good replica is a replica not in `FAIL` state from the point of view of a given node. The execution of the algorithm is triggered in every replica that detects that -there is at least a single master without good replicas. However among all the +there is at least a single primary without good replicas. However among all the replicas detecting this condition, only a subset should act. This subset is actually often a single replica unless different replicas have in a given moment a slightly different view of the failure state of other nodes. -The *acting replica* is the replica among the masters with the maximum number +The *acting replica* is the replica among the primaries with the maximum number of attached replicas, that is not in FAIL state and has the smallest node ID. -So for example if there are 10 masters with 1 replica each, and 2 masters with -5 replicas each, the replica that will try to migrate is - among the 2 masters +So for example if there are 10 primaries with 1 replica each, and 2 primaries with +5 replicas each, the replica that will try to migrate is - among the 2 primaries having 5 replicas - the one with the lowest node ID. Given that no agreement is used, it is possible that when the cluster configuration is not stable, a race condition occurs where multiple replicas believe themselves to be the non-failing replica with the lower node ID (it is unlikely for this to happen in practice). If this happens, the result is multiple replicas migrating to the -same master, which is harmless. If the race happens in a way that will leave -the ceding master without replicas, as soon as the cluster is stable again +same primary, which is harmless. If the race happens in a way that will leave +the ceding primary without replicas, as soon as the cluster is stable again the algorithm will be re-executed again and will migrate a replica back to -the original master. +the original primary. -Eventually every master will be backed by at least one replica. However, -the normal behavior is that a single replica migrates from a master with -multiple replicas to an orphaned master. +Eventually every primary will be backed by at least one replica. However, +the normal behavior is that a single replica migrates from a primary with +multiple replicas to an orphaned primary. The algorithm is controlled by a user-configurable parameter called -`cluster-migration-barrier`: the number of good replicas a master +`cluster-migration-barrier`: the number of good replicas a primary must be left with before a replica can migrate away. For example, if this -parameter is set to 2, a replica can try to migrate only if its master remains +parameter is set to 2, a replica can try to migrate only if its primary remains with two working replicas. ### configEpoch conflicts resolution algorithm @@ -1128,7 +1128,7 @@ created in an unsafe way, just incrementing the local `currentEpoch` of the local node and hoping there are no conflicts at the same time. Both the events are system-administrator triggered: -1. `CLUSTER FAILOVER` command with `TAKEOVER` option is able to manually promote a replica node into a master *without the majority of masters being available*. This is useful, for example, in multi data center setups. +1. `CLUSTER FAILOVER` command with `TAKEOVER` option is able to manually promote a replica node into a primary *without the majority of primaries being available*. This is useful, for example, in multi data center setups. 2. Migration of slots for cluster rebalancing also generates new configuration epochs inside the local node without agreement for performance reasons. Specifically, during manual resharding, when a hash slot is migrated from @@ -1153,19 +1153,19 @@ they are not propagated fast enough. Moreover, software bugs and filesystem corruptions can also contribute to multiple nodes having the same configuration epoch. -When masters serving different hash slots have the same `configEpoch`, there -are no issues. It is more important that replicas failing over a master have +When primaries serving different hash slots have the same `configEpoch`, there +are no issues. It is more important that replicas failing over a primary have unique configuration epochs. That said, manual interventions or resharding may change the cluster configuration in different ways. The Valkey Cluster main liveness property requires that slot configurations always converge, so under every circumstance -we really want all the master nodes to have a different `configEpoch`. +we really want all the primary nodes to have a different `configEpoch`. In order to enforce this, **a conflict resolution algorithm** is used in the event that two nodes end up with the same `configEpoch`. -* IF a master node detects another master node is advertising itself with +* IF a primary node detects another primary node is advertising itself with the same `configEpoch`. * AND IF the node has a lexicographically smaller Node ID compared to the other node claiming the same `configEpoch`. * THEN it increments its `currentEpoch` by 1, and uses it as the new `configEpoch`. @@ -1197,7 +1197,7 @@ provided, a soft reset is performed. The following is a list of operations performed by a reset: -1. Soft and hard reset: If the node is a replica, it is turned into a master, and its dataset is discarded. If the node is a master and contains keys the reset operation is aborted. +1. Soft and hard reset: If the node is a replica, it is turned into a primary, and its dataset is discarded. If the node is a primary and contains keys the reset operation is aborted. 2. Soft and hard reset: All the slots are released, and the manual failover state is reset. 3. Soft and hard reset: All the other nodes in the nodes table are removed, so the node no longer knows any other node. 4. Hard reset only: `currentEpoch`, `configEpoch`, and `lastVoteEpoch` are set to 0. @@ -1208,7 +1208,7 @@ Master nodes with non-empty data sets can't be reset (since normally you want to ### Removing nodes from a cluster It is possible to practically remove a node from an existing cluster by -resharding all its data to other nodes (if it is a master node) and +resharding all its data to other nodes (if it is a primary node) and shutting it down. However, the other nodes will still remember its node ID and address, and will attempt to connect with it. @@ -1236,7 +1236,7 @@ It will simply broadcast each published message to all other nodes. Redis OSS 7.0 and later features sharded pub/sub, in which shard channels are assigned to slots by the same algorithm used to assign keys to slots. A shard message must be sent to a node that owns the slot the shard channel is hashed to. -The cluster makes sure the published shard messages are forwarded to all nodes in the shard, so clients can subscribe to a shard channel by connecting to either the master responsible for the slot, or to any of its replicas. +The cluster makes sure the published shard messages are forwarded to all nodes in the shard, so clients can subscribe to a shard channel by connecting to either the primary responsible for the slot, or to any of its replicas. ## Appendix diff --git a/topics/cluster-tutorial.md b/topics/cluster-tutorial.md index 016adb7f..88c4a7fc 100644 --- a/topics/cluster-tutorial.md +++ b/topics/cluster-tutorial.md @@ -14,7 +14,7 @@ If you plan to run a production Valkey Cluster deployment or want to understand Valkey Cluster provides a way to run a Valkey installation where data is automatically sharded across multiple Valkey nodes. Valkey Cluster also provides some degree of availability during partitions—in practical terms, the ability to continue operations when some nodes fail or are unable to communicate. -However, the cluster will become unavailable in the event of larger failures (for example, when the majority of masters are unavailable). +However, the cluster will become unavailable in the event of larger failures (for example, when the majority of primaries are unavailable). So, with Valkey Cluster, you get the ability to: @@ -83,11 +83,11 @@ Hash tags are documented in the Valkey Cluster specification, but the gist is that if there is a substring between {} brackets in a key, only what is inside the string is hashed. For example, the keys `user:{123}:profile` and `user:{123}:account` are guaranteed to be in the same hash slot because they share the same hash tag. As a result, you can operate on these two keys in the same multi-key operation. -#### Valkey Cluster master-replica model +#### Valkey Cluster primary-replica model -To remain available when a subset of master nodes are failing or are +To remain available when a subset of primary nodes are failing or are not able to communicate with the majority of nodes, Valkey Cluster uses a -master-replica model where every hash slot has from 1 (the master itself) to N +primary-replica model where every hash slot has from 1 (the primary itself) to N replicas (N-1 additional replica nodes). In our example cluster with nodes A, B, C, if node B fails the cluster is not @@ -95,12 +95,12 @@ able to continue, since we no longer have a way to serve hash slots in the range 5501-11000. However, when the cluster is created (or at a later time), we add a replica -node to every master, so that the final cluster is composed of A, B, C -that are master nodes, and A1, B1, C1 that are replica nodes. +node to every primary, so that the final cluster is composed of A, B, C +that are primary nodes, and A1, B1, C1 that are replica nodes. This way, the system can continue if node B fails. Node B1 replicates B, and B fails, the cluster will promote node B1 as the new -master and will continue to operate correctly. +primary and will continue to operate correctly. However, note that if nodes B and B1 fail at the same time, Valkey Cluster will not be able to continue to operate. @@ -114,15 +114,15 @@ The first reason why Valkey Cluster can lose writes is because it uses asynchronous replication. This means that during writes the following happens: -* Your client writes to the master B. -* The master B replies OK to your client. -* The master B propagates the write to its replicas B1, B2 and B3. +* Your client writes to the primary B. +* The primary B replies OK to your client. +* The primary B propagates the write to its replicas B1, B2 and B3. As you can see, B does not wait for an acknowledgement from B1, B2, B3 before replying to the client, since this would be a prohibitive latency penalty for Valkey, so if your client writes something, B acknowledges the write, but crashes before being able to send the write to its replicas, one of the -replicas (that did not receive the write) can be promoted to master, losing +replicas (that did not receive the write) can be promoted to primary, losing the write forever. This is very similar to what happens with most databases that are @@ -141,37 +141,37 @@ implemented via the `WAIT` command. This makes losing writes a lot less likely. However, note that Valkey Cluster does not implement strong consistency even when synchronous replication is used: it is always possible, under more complex failure scenarios, that a replica that was not able to receive the write -will be elected as master. +will be elected as primary. There is another notable scenario where Valkey Cluster will lose writes, that happens during a network partition where a client is isolated with a minority -of instances including at least a master. +of instances including at least a primary. Take as an example our 6 nodes cluster composed of A, B, C, A1, B1, C1, -with 3 masters and 3 replicas. There is also a client, that we will call Z1. +with 3 primaries and 3 replicas. There is also a client, that we will call Z1. After a partition occurs, it is possible that in one side of the partition we have A, C, A1, B1, C1, and in the other side we have B and Z1. Z1 is still able to write to B, which will accept its writes. If the partition heals in a very short time, the cluster will continue normally. -However, if the partition lasts enough time for B1 to be promoted to master +However, if the partition lasts enough time for B1 to be promoted to primary on the majority side of the partition, the writes that Z1 has sent to B in the meantime will be lost. **Note:** There is a **maximum window** to the amount of writes Z1 will be able to send to B: if enough time has elapsed for the majority side of the -partition to elect a replica as master, every master node in the minority +partition to elect a replica as primary, every primary node in the minority side will have stopped accepting writes. This amount of time is a very important configuration directive of Valkey Cluster, and is called the **node timeout**. -After node timeout has elapsed, a master node is considered to be failing, +After node timeout has elapsed, a primary node is considered to be failing, and can be replaced by one of its replicas. -Similarly, after node timeout has elapsed without a master node to be able -to sense the majority of the other master nodes, it enters an error state +Similarly, after node timeout has elapsed without a primary node to be able +to sense the majority of the other primary nodes, it enters an error state and stops accepting writes. ## Valkey Cluster configuration parameters @@ -182,11 +182,11 @@ in the `valkey.conf` file. * **cluster-enabled ``**: If yes, enables Valkey Cluster support in a specific Valkey instance. Otherwise the instance starts as a standalone instance as usual. * **cluster-config-file ``**: Note that despite the name of this option, this is not a user editable configuration file, but the file where a Valkey Cluster node automatically persists the cluster configuration (the state, basically) every time there is a change, in order to be able to re-read it at startup. The file lists things like the other nodes in the cluster, their state, persistent variables, and so forth. Often this file is rewritten and flushed on disk as a result of some message reception. -* **cluster-node-timeout ``**: The maximum amount of time a Valkey Cluster node can be unavailable, without it being considered as failing. If a master node is not reachable for more than the specified amount of time, it will be failed over by its replicas. This parameter controls other important things in Valkey Cluster. Notably, every node that can't reach the majority of master nodes for the specified amount of time, will stop accepting queries. -* **cluster-slave-validity-factor ``**: If set to zero, a replica will always consider itself valid, and will therefore always try to failover a master, regardless of the amount of time the link between the master and the replica remained disconnected. If the value is positive, a maximum disconnection time is calculated as the *node timeout* value multiplied by the factor provided with this option, and if the node is a replica, it will not try to start a failover if the master link was disconnected for more than the specified amount of time. For example, if the node timeout is set to 5 seconds and the validity factor is set to 10, a replica disconnected from the master for more than 50 seconds will not try to failover its master. Note that any value different than zero may result in Valkey Cluster being unavailable after a master failure if there is no replica that is able to failover it. In that case the cluster will return to being available only when the original master rejoins the cluster. -* **cluster-migration-barrier ``**: Minimum number of replicas a master will remain connected with, for another replica to migrate to a master which is no longer covered by any replica. See the appropriate section about replica migration in this tutorial for more information. +* **cluster-node-timeout ``**: The maximum amount of time a Valkey Cluster node can be unavailable, without it being considered as failing. If a primary node is not reachable for more than the specified amount of time, it will be failed over by its replicas. This parameter controls other important things in Valkey Cluster. Notably, every node that can't reach the majority of primary nodes for the specified amount of time, will stop accepting queries. +* **cluster-replica-validity-factor ``**: If set to zero, a replica will always consider itself valid, and will therefore always try to failover a primary, regardless of the amount of time the link between the primary and the replica remained disconnected. If the value is positive, a maximum disconnection time is calculated as the *node timeout* value multiplied by the factor provided with this option, and if the node is a replica, it will not try to start a failover if the primary link was disconnected for more than the specified amount of time. For example, if the node timeout is set to 5 seconds and the validity factor is set to 10, a replica disconnected from the primary for more than 50 seconds will not try to failover its primary. Note that any value different than zero may result in Valkey Cluster being unavailable after a primary failure if there is no replica that is able to failover it. In that case the cluster will return to being available only when the original primary rejoins the cluster. +* **cluster-migration-barrier ``**: Minimum number of replicas a primary will remain connected with, for another replica to migrate to a primary which is no longer covered by any replica. See the appropriate section about replica migration in this tutorial for more information. * **cluster-require-full-coverage ``**: If this is set to yes, as it is by default, the cluster stops accepting writes if some percentage of the key space is not covered by any node. If the option is set to no, the cluster will still serve queries even if only requests about a subset of keys can be processed. -* **cluster-allow-reads-when-down ``**: If this is set to no, as it is by default, a node in a Valkey Cluster will stop serving all traffic when the cluster is marked as failed, either when a node can't reach a quorum of masters or when full coverage is not met. This prevents reading potentially inconsistent data from a node that is unaware of changes in the cluster. This option can be set to yes to allow reads from a node during the fail state, which is useful for applications that want to prioritize read availability but still want to prevent inconsistent writes. It can also be used for when using Valkey Cluster with only one or two shards, as it allows the nodes to continue serving writes when a master fails but automatic failover is impossible. +* **cluster-allow-reads-when-down ``**: If this is set to no, as it is by default, a node in a Valkey Cluster will stop serving all traffic when the cluster is marked as failed, either when a node can't reach a quorum of primaries or when full coverage is not met. This prevents reading potentially inconsistent data from a node that is unaware of changes in the cluster. This option can be set to yes to allow reads from a node during the fail state, which is useful for applications that want to prioritize read availability but still want to prevent inconsistent writes. It can also be used for when using Valkey Cluster with only one or two shards, as it allows the nodes to continue serving writes when a primary fails but automatic failover is impossible. ## Create and use a Valkey Cluster @@ -228,8 +228,8 @@ This file is never touched by humans; it is simply generated at startup by the Valkey Cluster instances, and updated every time it is needed. Note that the **minimal cluster** that works as expected must contain -at least three master nodes. For deployment, we strongly recommend -a six-node cluster, with three masters and three replicas. +at least three primary nodes. For deployment, we strongly recommend +a six-node cluster, with three primaries and three replicas. You can test this locally by creating the following directories named after the port number of the instance you'll run inside any given directory. @@ -278,7 +278,7 @@ To create the cluster, run: --cluster-replicas 1 The command used here is **create**, since we want to create a new cluster. -The option `--cluster-replicas 1` means that we want a replica for every master created. +The option `--cluster-replicas 1` means that we want a replica for every primary created. The other arguments are the list of addresses of the instances I want to use to create the new cluster. @@ -289,7 +289,7 @@ bootstrapped into talking with each other. Finally, if everything has gone well, [OK] All 16384 slots covered -This means that there is at least one master instance serving each of the +This means that there is at least one primary instance serving each of the 16384 available slots. If you don't want to create a Valkey Cluster by configuring and executing @@ -299,7 +299,7 @@ system (but you'll not learn the same amount of operational details). Find the `utils/create-cluster` directory in the Valkey distribution. There is a script called `create-cluster` inside (same name as the directory it is contained into), it's a simple bash script. In order to start -a 6 nodes cluster with 3 masters and 3 replicas just type the following +a 6 nodes cluster with 3 primaries and 3 replicas just type the following commands: 1. `create-cluster start` @@ -506,7 +506,7 @@ call. Then valkey-cli needs to know what is the target of the resharding, that is, the node that will receive the hash slots. -I'll use the first master node, that is, 127.0.0.1:7000, but I need +I'll use the first primary node, that is, 127.0.0.1:7000, but I need to specify the Node ID of the instance. This was already printed in a list by valkey-cli, but I can always find the ID of a node with the following command if I need: @@ -520,7 +520,7 @@ Ok so my target node is 97a3a64667477371c4479320d683e4c8db5858b1. Now you'll get asked from what nodes you want to take those keys. I'll just type `all` in order to take a bit of hash slots from all the -other master nodes. +other primary nodes. After the final confirmation you'll see a message for every slot that valkey-cli is going to move from a node to another, and a dot will be printed @@ -535,7 +535,7 @@ the following command: valkey-cli --cluster check 127.0.0.1:7000 -All the slots will be covered as usual, but this time the master at +All the slots will be covered as usual, but this time the primary at 127.0.0.1:7000 will have more hash slots, something around 6461. Resharding can be performed automatically without the need to manually @@ -623,13 +623,13 @@ to test the Valkey Cluster failover. To trigger the failover, the simplest thing we can do (that is also the semantically simplest failure that can occur in a distributed system) -is to crash a single process, in our case a single master. +is to crash a single process, in our case a single primary. **Note:** During this test, you should take a tab open with the consistency test application running. -We can identify a master and crash it with the following command: +We can identify a primary and crash it with the following command: ``` $ valkey-cli -p 7000 cluster nodes | grep master @@ -638,7 +638,7 @@ $ valkey-cli -p 7000 cluster nodes | grep master 97a3a64667477371c4479320d683e4c8db5858b1 :0 myself,master - 0 0 0 connected 0-5959 10922-11422 ``` -Ok, so 7000, 7001, and 7002 are masters. Let's crash node 7002 with the +Ok, so 7000, 7001, and 7002 are primaries. Let's crash node 7002 with the **DEBUG SEGFAULT** command: ``` @@ -685,8 +685,8 @@ a211e242fc6b22a9427fed61285e85892fa04e08 127.0.0.1:7003 slave 97a3a64667477371c4 2938205e12de373867bf38f1ca29d31d0ddb3e46 127.0.0.1:7002 slave 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 0 1385503418016 3 connected ``` -Now the masters are running on ports 7000, 7001 and 7005. What was previously -a master, that is the Valkey instance running on port 7002, is now a replica of +Now the primaries are running on ports 7000, 7001 and 7005. What was previously +a primary, that is the Valkey instance running on port 7002, is now a replica of 7005. The output of the `CLUSTER NODES` command may look intimidating, but it is actually pretty simple, and is composed of the following tokens: @@ -704,48 +704,48 @@ The output of the `CLUSTER NODES` command may look intimidating, but it is actua #### Manual failover Sometimes it is useful to force a failover without actually causing any problem -on a master. For example, to upgrade the Valkey process of one of the -master nodes it is a good idea to failover it to turn it into a replica +on a primary. For example, to upgrade the Valkey process of one of the +primary nodes it is a good idea to failover it to turn it into a replica with minimal impact on availability. Manual failovers are supported by Valkey Cluster using the `CLUSTER FAILOVER` -command, that must be executed in one of the replicas of the master you want +command, that must be executed in one of the replicas of the primary you want to failover. Manual failovers are special and are safer compared to failovers resulting from -actual master failures. They occur in a way that avoids data loss in the -process, by switching clients from the original master to the new master only -when the system is sure that the new master processed all the replication stream +actual primary failures. They occur in a way that avoids data loss in the +process, by switching clients from the original primary to the new primary only +when the system is sure that the new primary processed all the replication stream from the old one. This is what you see in the replica log when you perform a manual failover: # Manual failover user request accepted. - # Received replication offset for paused master manual failover: 347540 - # All master replication stream processed, manual failover can start. + # Received replication offset for paused primary manual failover: 347540 + # All primary replication stream processed, manual failover can start. # Start of election delayed for 0 milliseconds (rank #0, offset 347540). # Starting a failover election for epoch 7545. - # Failover election won: I'm the new master. + # Failover election won: I'm the new primary. -Basically clients connected to the master we are failing over are stopped. -At the same time the master sends its replication offset to the replica, that +Basically clients connected to the primary we are failing over are stopped. +At the same time the primary sends its replication offset to the replica, that waits to reach the offset on its side. When the replication offset is reached, -the failover starts, and the old master is informed about the configuration -switch. When the clients are unblocked on the old master, they are redirected -to the new master. +the failover starts, and the old primary is informed about the configuration +switch. When the clients are unblocked on the old primary, they are redirected +to the new primary. **Note:** -To promote a replica to master, it must first be known as a replica by a majority of the masters in the cluster. +To promote a replica to primary, it must first be known as a replica by a majority of the primaries in the cluster. Otherwise, it cannot win the failover election. - If the replica has just been added to the cluster (see [Add a new node as a replica](#add-a-new-node-as-a-replica)), you may need to wait a while before sending the `CLUSTER FAILOVER` command, to make sure the masters in cluster are aware of the new replica. + If the replica has just been added to the cluster (see [Add a new node as a replica](#add-a-new-node-as-a-replica)), you may need to wait a while before sending the `CLUSTER FAILOVER` command, to make sure the primaries in cluster are aware of the new replica. #### Add a new node Adding a new node is basically the process of adding an empty node and then -moving some data into it, in case it is a new master, or telling it to +moving some data into it, in case it is a new primary, or telling it to setup as a replica of a known node, in case it is a replica. -We'll show both, starting with the addition of a new master instance. +We'll show both, starting with the addition of a new primary instance. In both cases the first step to perform is **adding an empty node**. @@ -792,10 +792,10 @@ a211e242fc6b22a9427fed61285e85892fa04e08 127.0.0.1:7003 slave 97a3a64667477371c4 Note that since this node is already connected to the cluster it is already able to redirect client queries correctly and is generally speaking part of -the cluster. However it has two peculiarities compared to the other masters: +the cluster. However it has two peculiarities compared to the other primaries: * It holds no data as it has no assigned hash slots. -* Because it is a master without assigned slots, it does not participate in the election process when a replica wants to become a master. +* Because it is a primary without assigned slots, it does not participate in the election process when a replica wants to become a primary. Now it is possible to assign hash slots to this node using the resharding feature of `valkey-cli`. @@ -806,31 +806,31 @@ having as a target the empty node. ##### Add a new node as a replica Adding a new replica can be performed in two ways. The obvious one is to -use valkey-cli again, but with the --cluster-slave option, like this: +use valkey-cli again, but with the --cluster-replica option, like this: - valkey-cli --cluster add-node 127.0.0.1:7006 127.0.0.1:7000 --cluster-slave + valkey-cli --cluster add-node 127.0.0.1:7006 127.0.0.1:7000 --cluster-replica Note that the command line here is exactly like the one we used to add -a new master, so we are not specifying to which master we want to add +a new primary, so we are not specifying to which primary we want to add the replica. In this case, what happens is that valkey-cli will add the new -node as replica of a random master among the masters with fewer replicas. +node as replica of a random primary among the primaries with fewer replicas. -However you can specify exactly what master you want to target with your +However you can specify exactly what primary you want to target with your new replica with the following command line: - valkey-cli --cluster add-node 127.0.0.1:7006 127.0.0.1:7000 --cluster-slave --cluster-master-id 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e + valkey-cli --cluster add-node 127.0.0.1:7006 127.0.0.1:7000 --cluster-replica --cluster-master-id 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e -This way we assign the new replica to a specific master. +This way we assign the new replica to a specific primary. -A more manual way to add a replica to a specific master is to add the new -node as an empty master, and then turn it into a replica using the +A more manual way to add a replica to a specific primary is to add the new +node as an empty primary, and then turn it into a replica using the `CLUSTER REPLICATE` command. This also works if the node was added as a replica -but you want to move it as a replica of a different master. +but you want to move it as a replica of a different primary. For example in order to add a replica for the node 127.0.0.1:7005 that is currently serving hash slots in the range 11423-16383, that has a Node ID 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e, all I need to do is to connect -with the new node (already added as empty master) and send the command: +with the new node (already added as empty primary) and send the command: valkey 127.0.0.1:7006> cluster replicate 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e @@ -840,8 +840,8 @@ update their config). We can verify with the following command: ``` $ valkey-cli -p 7000 cluster nodes | grep slave | grep 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e -f093c80dde814da99c5cf72a7dd01590792b783b 127.0.0.1:7006 slave 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 0 1385543617702 3 connected -2938205e12de373867bf38f1ca29d31d0ddb3e46 127.0.0.1:7002 slave 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 0 1385543617198 3 connected +f093c80dde814da99c5cf72a7dd01590792b783b 127.0.0.1:7006 replica 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 0 1385543617702 3 connected +2938205e12de373867bf38f1ca29d31d0ddb3e46 127.0.0.1:7002 replica 3c3a0c74aae0b56170ccb03a76b60cfe7dc1912e 0 1385543617198 3 connected ``` The node 3c3a0c... now has two replicas, running on ports 7002 (the existing one) and 7006 (the new one). @@ -855,14 +855,14 @@ To remove a replica node just use the `del-node` command of valkey-cli: The first argument is just a random node in the cluster, the second argument is the ID of the node you want to remove. -You can remove a master node in the same way as well, **however in order to -remove a master node it must be empty**. If the master is not empty you need -to reshard data away from it to all the other master nodes before. +You can remove a primary node in the same way as well, **however in order to +remove a primary node it must be empty**. If the primary is not empty you need +to reshard data away from it to all the other primary nodes before. -An alternative to remove a master node is to perform a manual failover of it +An alternative to remove a primary node is to perform a manual failover of it over one of its replicas and remove the node after it turned into a replica of the -new master. Obviously this does not help when you want to reduce the actual -number of masters in your cluster, in that case, a resharding is needed. +new primary. Obviously this does not help when you want to reduce the actual +number of primaries in your cluster, in that case, a resharding is needed. There is a special scenario where you want to remove a failed node. You should not use the `del-node` command because it tries to connect to all nodes and you will encounter a "connection refused" error. @@ -875,12 +875,12 @@ This command will execute `CLUSTER FORGET` command on every node. #### Replica migration In Valkey Cluster, you can reconfigure a replica to replicate with a -different master at any time just using this command: +different primary at any time just using this command: CLUSTER REPLICATE However there is a special scenario where you want replicas to move from one -master to another one automatically, without the help of the system administrator. +primary to another one automatically, without the help of the system administrator. The automatic reconfiguration of replicas is called *replicas migration* and is able to improve the reliability of a Valkey Cluster. @@ -888,38 +888,38 @@ able to improve the reliability of a Valkey Cluster. You can read the details of replicas migration in the [Valkey Cluster Specification](cluster-spec.md), here we'll only provide some information about the general idea and what you should do in order to benefit from it. -The reason why you may want to let your cluster replicas to move from one master +The reason why you may want to let your cluster replicas to move from one primary to another under certain condition, is that usually the Valkey Cluster is as -resistant to failures as the number of replicas attached to a given master. +resistant to failures as the number of replicas attached to a given primary. -For example a cluster where every master has a single replica can't continue -operations if the master and its replica fail at the same time, simply because -there is no other instance to have a copy of the hash slots the master was +For example a cluster where every primary has a single replica can't continue +operations if the primary and its replica fail at the same time, simply because +there is no other instance to have a copy of the hash slots the primary was serving. However while net-splits are likely to isolate a number of nodes at the same time, many other kind of failures, like hardware or software failures local to a single node, are a very notable class of failures that are unlikely to happen at the same time, so it is possible that in your cluster where -every master has a replica, the replica is killed at 4am, and the master is killed +every primary has a replica, the replica is killed at 4am, and the primary is killed at 6am. This still will result in a cluster that can no longer operate. To improve reliability of the system we have the option to add additional -replicas to every master, but this is expensive. Replica migration allows to -add more replicas to just a few masters. So you have 10 masters with 1 replica +replicas to every primary, but this is expensive. Replica migration allows to +add more replicas to just a few primaries. So you have 10 primaries with 1 replica each, for a total of 20 instances. However you add, for example, 3 instances -more as replicas of some of your masters, so certain masters will have more +more as replicas of some of your primaries, so certain primaries will have more than a single replica. -With replicas migration what happens is that if a master is left without -replicas, a replica from a master that has multiple replicas will migrate to -the *orphaned* master. So after your replica goes down at 4am as in the example -we made above, another replica will take its place, and when the master +With replicas migration what happens is that if a primary is left without +replicas, a replica from a primary that has multiple replicas will migrate to +the *orphaned* primary. So after your replica goes down at 4am as in the example +we made above, another replica will take its place, and when the primary will fail as well at 5am, there is still a replica that can be elected so that the cluster can continue to operate. So what you should know about replicas migration in short? -* The cluster will try to migrate a replica from the master that has the greatest number of replicas in a given moment. -* To benefit from replica migration you have just to add a few more replicas to a single master in your cluster, it does not matter what master. +* The cluster will try to migrate a replica from the primary that has the greatest number of replicas in a given moment. +* To benefit from replica migration you have just to add a few more replicas to a single primary in your cluster, it does not matter what primary. * There is a configuration parameter that controls the replica migration feature that is called `cluster-migration-barrier`: you can read more about it in the example `valkey.conf` file provided with Valkey Cluster. #### Upgrade nodes in a Valkey Cluster @@ -929,20 +929,20 @@ it with an updated version of Valkey. If there are clients scaling reads using replica nodes, they should be able to reconnect to a different replica if a given one is not available. -Upgrading masters is a bit more complex, and the suggested procedure is: +Upgrading primaries is a bit more complex, and the suggested procedure is: -1. Use `CLUSTER FAILOVER` to trigger a manual failover of the master to one of its replicas. +1. Use `CLUSTER FAILOVER` to trigger a manual failover of the primary to one of its replicas. (See the [Manual failover](#manual-failover) in this topic.) -2. Wait for the master to turn into a replica. +2. Wait for the primary to turn into a replica. 3. Finally upgrade the node as you do for replicas. -4. If you want the master to be the node you just upgraded, trigger a new manual failover in order to turn back the upgraded node into a master. +4. If you want the primary to be the node you just upgraded, trigger a new manual failover in order to turn back the upgraded node into a primary. Following this procedure you should upgrade one node after the other until all the nodes are upgraded. #### Migrate to Valkey Cluster -Users willing to migrate to Valkey Cluster may have just a single master, or +Users willing to migrate to Valkey Cluster may have just a single primary, or may already using a preexisting sharding setup, where keys are split among N nodes, using some in-house algorithm or a sharding algorithm implemented by their client library or Valkey proxy. @@ -962,14 +962,14 @@ the context of the same hash tag. Case 1 and 2 are covered, so we'll focus on those two cases, that are handled in the same way, so no distinction will be made in the documentation. -Assuming you have your preexisting data set split into N masters, where +Assuming you have your preexisting data set split into N primaries, where N=1 if you have no preexisting sharding, the following steps are needed in order to migrate your data set to Valkey Cluster: 1. Stop your clients. No automatic live-migration to Valkey Cluster is currently possible. You may be able to do it orchestrating a live migration in the context of your application / environment. -2. Generate an append only file for all of your N masters using the `BGREWRITEAOF` command, and waiting for the AOF file to be completely generated. +2. Generate an append only file for all of your N primaries using the `BGREWRITEAOF` command, and waiting for the AOF file to be completely generated. 3. Save your AOF files from aof-1 to aof-N somewhere. At this point you can stop your old instances if you wish (this is useful since in non-virtualized deployments you often need to reuse the same computers). -4. Create a Valkey Cluster composed of N masters and zero replicas. You'll add replicas later. Make sure all your nodes are using the append only file for persistence. +4. Create a Valkey Cluster composed of N primaries and zero replicas. You'll add replicas later. Make sure all your nodes are using the append only file for persistence. 5. Stop all the cluster nodes, substitute their append only file with your pre-existing append only files, aof-1 for the first node, aof-2 for the second node, up to aof-N. 6. Restart your Valkey Cluster nodes with the new AOF files. They'll complain that there are keys that should not be there according to their configuration. 7. Use `valkey-cli --cluster fix` command in order to fix the cluster so that keys will be migrated according to the hash slots each node is authoritative or not. @@ -983,7 +983,7 @@ The command moves all the keys of a running instance (deleting the keys from the source instance) to the specified pre-existing Valkey Cluster. **Note:** -If not for backward compatibility, the Valkey project no longer uses the word slave. Unfortunately in this command the word slave is part of the protocol, so we'll be able to remove such occurrences only when this API will be naturally deprecated. +If not for backward compatibility, the Valkey project no longer uses the words "master" and "slave". Unfortunately in this command these words are part of the protocol, so we'll be able to remove such occurrences only when this API will be naturally deprecated. ## Learn more diff --git a/topics/command-tips.md b/topics/command-tips.md index 65ad4c06..c632a3bc 100644 --- a/topics/command-tips.md +++ b/topics/command-tips.md @@ -39,11 +39,11 @@ The default behavior a client should implement for commands without the _request In cases where the client should adopt a behavior different than the default, the _request_policy_ tip can be one of: -- **all_nodes:** the client should execute the command on all nodes - masters and replicas alike. +- **all_nodes:** the client should execute the command on all nodes - primaries and replicas alike. An example is the `CONFIG SET` command. This tip is in-use by commands that don't accept key name arguments. The command operates atomically per shard. -* **all_shards:** the client should execute the command on all master shards (e.g., the `DBSIZE` command). +* **all_shards:** the client should execute the command on all primary shards (e.g., the `DBSIZE` command). This tip is in-use by commands that don't accept key name arguments. The command operates atomically per shard. - **multi_shard:** the client should execute the command on several shards. diff --git a/topics/distlock.md b/topics/distlock.md index 68bb9be7..47f66bf9 100644 --- a/topics/distlock.md +++ b/topics/distlock.md @@ -60,14 +60,15 @@ To understand what we want to improve, let’s analyze the current state of affa The simplest way to use Valkey to lock a resource is to create a key in an instance. The key is usually created with a limited time to live, using the Valkey expires feature, so that eventually it will get released (property 2 in our list). When the client needs to release the resource, it deletes the key. -Superficially this works well, but there is a problem: this is a single point of failure in our architecture. What happens if the Valkey master goes down? -Well, let’s add a replica! And use it if the master is unavailable. This is unfortunately not viable. By doing so we can’t implement our safety property of mutual exclusion, because Valkey replication is asynchronous. +Superficially this works well, but there is a problem: this is a single point of failure in our architecture. What happens if the Valkey primary goes down? + +Well, let’s add a replica! And use it if the primary is unavailable. This is unfortunately not viable. By doing so we can’t implement our safety property of mutual exclusion, because Valkey replication is asynchronous. There is a race condition with this model: -1. Client A acquires the lock in the master. -2. The master crashes before the write to the key is transmitted to the replica. -3. The replica gets promoted to master. +1. Client A acquires the lock in the primary. +2. The primary crashes before the write to the key is transmitted to the replica. +3. The replica gets promoted to primary. 4. Client B acquires the lock to the same resource A already holds a lock for. **SAFETY VIOLATION!** Sometimes it is perfectly fine that, under special circumstances, for example during a failure, multiple clients can hold the lock at the same time. @@ -105,7 +106,7 @@ So now we have a good way to acquire and release the lock. With this system, rea ## The Redlock Algorithm -In the distributed version of the algorithm we assume we have N Valkey masters. Those nodes are totally independent, so we don’t use replication or any other implicit coordination system. We already described how to acquire and release the lock safely in a single instance. We take for granted that the algorithm will use this method to acquire and release the lock in a single instance. In our examples we set N=5, which is a reasonable value, so we need to run 5 Valkey masters on different computers or virtual machines in order to ensure that they’ll fail in a mostly independent way. +In the distributed version of the algorithm we assume we have N Valkey primaries. Those nodes are totally independent, so we don’t use replication or any other implicit coordination system. We already described how to acquire and release the lock safely in a single instance. We take for granted that the algorithm will use this method to acquire and release the lock in a single instance. In our examples we set N=5, which is a reasonable value, so we need to run 5 Valkey primaries on different computers or virtual machines in order to ensure that they’ll fail in a mostly independent way. In order to acquire the lock, the client performs the following operations: diff --git a/topics/encryption.md b/topics/encryption.md index d0886b0f..1f7b770c 100644 --- a/topics/encryption.md +++ b/topics/encryption.md @@ -87,12 +87,12 @@ You may use `tls-auth-clients no` to disable client authentication. ### Replication -A Valkey master server handles connecting clients and replica servers in the same +A Valkey primary server handles connecting clients and replica servers in the same way, so the above `tls-port` and `tls-auth-clients` directives apply to replication links as well. On the replica server side, it is necessary to specify `tls-replication yes` to -use TLS for outgoing connections to the master. +use TLS for outgoing connections to the primary. ### Cluster @@ -104,7 +104,7 @@ cluster bus and cross-node connections. Sentinel inherits its networking configuration from the common Valkey configuration, so all of the above applies to Sentinel as well. -When connecting to master servers, Sentinel will use the `tls-replication` +When connecting to primary servers, Sentinel will use the `tls-replication` directive to determine if a TLS or non-TLS connection is required. In addition, the very same `tls-replication` directive will determine whether Sentinel's diff --git a/topics/eval-intro.md b/topics/eval-intro.md index 659ba87f..b387406f 100644 --- a/topics/eval-intro.md +++ b/topics/eval-intro.md @@ -160,7 +160,7 @@ Here's an example of loading and then executing a cached script: The Valkey script cache is **always volatile**. It isn't considered as a part of the database and is **not persisted**. -The cache may be cleared when the server restarts, during fail-over when a replica assumes the master role, or explicitly by `SCRIPT FLUSH`. +The cache may be cleared when the server restarts, during fail-over when a replica assumes the primary role, or explicitly by `SCRIPT FLUSH`. That means that cached scripts are ephemeral, and the cache's contents can be lost at any time. Applications that use scripts should always call `EVALSHA` to execute them. @@ -229,17 +229,17 @@ These are: ## Script replication -In standalone deployments, a single Valkey instance called _master_ manages the entire database. -A [clustered deployment](cluster-tutorial.md) has at least three masters managing the sharded database. -Valkey uses [replication](replication.md) to maintain one or more replicas, or exact copies, for any given master. +In standalone deployments, a single Valkey instance called _primary_ manages the entire database. +A [clustered deployment](cluster-tutorial.md) has at least three primaries managing the sharded database. +Valkey uses [replication](replication.md) to maintain one or more replicas, or exact copies, for any given primary. Because scripts can modify the data, Valkey ensures all write operations performed by a script are also sent to replicas to maintain consistency. There are two conceptual approaches when it comes to script replication: -1. Verbatim replication: the master sends the script's source code to the replicas. +1. Verbatim replication: the primary sends the script's source code to the replicas. Replicas then execute the script and apply the write effects. This mode can save on replication bandwidth in cases where short scripts generate many commands (for example, a _for_ loop). - However, this replication mode means that replicas redo the same work done by the master, which is wasteful. + However, this replication mode means that replicas redo the same work done by the primary, which is wasteful. More importantly, it also requires [all write scripts to be deterministic](#scripts-with-deterministic-writes). 1. Effects replication: only the script's data-modifying commands are replicated. Replicas then run the commands without executing any scripts. @@ -295,7 +295,7 @@ Scripts executed in a Valkey instance are, by default until version 5.0, propaga Since the script will be re-run on the remote host (or when reloading the AOF file), its changes to the database must be reproducible. The reason for sending the script is that it is often much faster than sending the multiple commands that the script generates. -If the client is sending many scripts to the master, converting the scripts into individual commands for the replica / AOF would result in too much bandwidth for the replication link or the Append Only File (and also too much CPU since dispatching a command received via the network is a lot more work for Valkey compared to dispatching a command invoked by Lua scripts). +If the client is sending many scripts to the primary, converting the scripts into individual commands for the replica / AOF would result in too much bandwidth for the replication link or the Append Only File (and also too much CPU since dispatching a command received via the network is a lot more work for Valkey compared to dispatching a command invoked by Lua scripts). Normally replicating scripts instead of the effects of the scripts makes sense, however not in all the cases. So starting with Redis OSS 3.2, the scripting engine is able to, alternatively, replicate the sequence of write commands resulting from the script execution, instead of replication the script itself. diff --git a/topics/faq.md b/topics/faq.md index 458f5e15..6cba300c 100644 --- a/topics/faq.md +++ b/topics/faq.md @@ -126,7 +126,7 @@ Every hash, list, set, and sorted set, can hold 2^32 elements. In other words your limit is likely the available memory in your system. -## Why does my replica have a different number of keys its master instance? +## Why does my replica have a different number of keys than its primary instance? If you use keys with limited time to live (Valkey expires) this is normal behavior. This is what happens: diff --git a/topics/functions-intro.md b/topics/functions-intro.md index 55e3bc64..5173695a 100644 --- a/topics/functions-intro.md +++ b/topics/functions-intro.md @@ -65,7 +65,7 @@ This allows calling functions from other functions within the same library, or s Functions are intended to better support the use case of maintaining a consistent view for data entities through a logical schema, as mentioned above. As such, functions are stored alongside the data itself. -Functions are also persisted to the AOF file and replicated from master to replicas, so they are as durable as the data itself. +Functions are also persisted to the AOF file and replicated from primary to replicas, so they are as durable as the data itself. When Valkey is used as an ephemeral cache, additional mechanisms (described below) are required to make functions more durable. Like all other operations in Valkey, the execution of a function is atomic. @@ -375,7 +375,7 @@ And your Valkey log file should have lines in it that are similar to: As noted above, Valkey automatically handles propagation of loaded functions to replicas. In a Valkey Cluster, it is also necessary to load functions to all cluster nodes. This is not handled automatically by Valkey Cluster, and needs to be handled by the cluster administrator (like module loading, configuration setting, etc.). -As one of the goals of functions is to live separately from the client application, this should not be part of the Valkey client library responsibilities. Instead, `valkey-cli --cluster-only-masters --cluster call host:port FUNCTION LOAD ...` can be used to execute the load command on all master nodes. +As one of the goals of functions is to live separately from the client application, this should not be part of the Valkey client library responsibilities. Instead, `valkey-cli --cluster-only-masters --cluster call host:port FUNCTION LOAD ...` can be used to execute the load command on all primary nodes. Also, note that `valkey-cli --cluster add-node` automatically takes care to propagate the loaded functions from one of the existing nodes to the new node. diff --git a/topics/migration.md b/topics/migration.md index aaff7a43..59c0779c 100644 --- a/topics/migration.md +++ b/topics/migration.md @@ -282,7 +282,7 @@ For this scenario, we assume that you have Redis Cluster consisting of 3 primary ``` $ docker exec -it bash - $ redis-cli --cluster add-node :6379 :6379 --cluster-slave + $ redis-cli --cluster add-node :6379 :6379 --cluster-replica ``` 6. Check the cluster status @@ -326,7 +326,7 @@ For this scenario, we assume that you have Redis Cluster consisting of 3 primary b. Add a new node to a specific primary: ``` - $ valkey-cli --cluster add-node 172.22.0.10:6379 172.22.0.2:6379 --cluster-slave --cluster-master-id + $ valkey-cli --cluster add-node 172.22.0.10:6379 172.22.0.2:6379 --cluster-replica --cluster-master-id ``` 12. Remove Redis nodes: diff --git a/topics/modules-api-ref.md b/topics/modules-api-ref.md index b53d9e08..70edbd4c 100644 --- a/topics/modules-api-ref.md +++ b/topics/modules-api-ref.md @@ -339,7 +339,7 @@ example "write deny-oom". The set of flags are: Starting from Redis OSS 7.0 this flag has been deprecated. Declaring a command as "random" can be done using command tips, see https://valkey.io/topics/command-tips. -* **"allow-stale"**: The command is allowed to run on slaves that don't +* **"allow-stale"**: The command is allowed to run on replicas that don't serve stale data. Don't use if you don't know what this means. * **"no-monitor"**: Don't propagate the command on monitor. Use this if @@ -1899,7 +1899,7 @@ The function always returns `VALKEYMODULE_OK`. **Available since:** 4.0.0 -Replicate the specified command and arguments to slaves and AOF, as effect +Replicate the specified command and arguments to replicas and AOF, as effect of execution of the calling command implementation. The replicated commands are always wrapped into the MULTI/EXEC that @@ -1951,7 +1951,7 @@ a MULTI/EXEC stanza, so it should not be mixed with other replication commands. Basically this form of replication is useful when you want to propagate -the command to the slaves and AOF file exactly as it was called, since +the command to the replicas and AOF file exactly as it was called, since the command can just be re-executed to deterministically re-create the new state starting from the old one. @@ -2168,16 +2168,16 @@ Available flags and their meaning: * `VALKEYMODULE_CTX_FLAGS_LOADING`: Server is loading RDB/AOF - * `VALKEYMODULE_CTX_FLAGS_REPLICA_IS_STALE`: No active link with the master. + * `VALKEYMODULE_CTX_FLAGS_REPLICA_IS_STALE`: No active link with the primary. * `VALKEYMODULE_CTX_FLAGS_REPLICA_IS_CONNECTING`: The replica is trying to - connect with the master. + connect with the primary. * `VALKEYMODULE_CTX_FLAGS_REPLICA_IS_TRANSFERRING`: Master -> Replica RDB transfer is in progress. * `VALKEYMODULE_CTX_FLAGS_REPLICA_IS_ONLINE`: The replica has an active link - with its master. This is the + with its primary. This is the contrary of STALE state. * `VALKEYMODULE_CTX_FLAGS_ACTIVE_CHILD`: There is currently some background @@ -2204,16 +2204,16 @@ Available flags and their meaning: Returns true if a client sent the CLIENT PAUSE command to the server or if the Cluster does a manual failover, pausing the clients. -This is needed when we have a master with replicas, and want to write, +This is needed when we have a primary with replicas, and want to write, without adding further data to the replication channel, that the replicas -replication offset, match the one of the master. When this happens, it is -safe to failover the master without data loss. +replication offset, match the one of the primary. When this happens, it is +safe to failover the primary without data loss. However modules may generate traffic by calling [`ValkeyModule_Call()`](#ValkeyModule_Call) with the "!" flag, or by calling [`ValkeyModule_Replicate()`](#ValkeyModule_Replicate), in a context outside commands execution, for instance in timeout callbacks, threads safe contexts, and so forth. When modules will generate too much traffic, it -will be hard for the master and replicas offset to match, because there +will be hard for the primary and replicas offset to match, because there is more data to send in the replication channel. So modules may want to try to avoid very heavy background work that has @@ -5218,7 +5218,7 @@ is returned. The arguments `ip`, `master_id`, `port` and `flags` can be NULL in case we don't need to populate back certain info. If an `ip` and `master_id` (only populated -if the instance is a slave) are specified, they point to buffers holding +if the instance is a replica) are specified, they point to buffers holding at least `VALKEYMODULE_NODE_ID_LEN` bytes. The strings written back as `ip` and `master_id` are not null terminated. @@ -5229,7 +5229,7 @@ The list of flags reported is the following: * `VALKEYMODULE_NODE_REPLICA`: The node is a replica * `VALKEYMODULE_NODE_PFAIL`: We see the node as failing * `VALKEYMODULE_NODE_FAIL`: The cluster agrees the node is failing -* `VALKEYMODULE_NODE_NOFAILOVER`: The slave is configured to never failover +* `VALKEYMODULE_NODE_NOFAILOVER`: The replica is configured to never failover @@ -5250,7 +5250,7 @@ message bus. Flags that can be set: With the following effects: -* `NO_FAILOVER`: prevent Cluster slaves from failing over a dead master. +* `NO_FAILOVER`: prevent Cluster replicas from failing over a dead primary. Also disables the replica migration feature. * `NO_REDIRECTION`: Every node will accept any key, without trying to perform @@ -6458,7 +6458,7 @@ filter applies in all execution paths including: 1. Invocation by a client. 2. Invocation through [`ValkeyModule_Call()`](#ValkeyModule_Call) by any module. 3. Invocation through Lua `redis.call()`. -4. Replication of a command from a master. +4. Replication of a command from a primary. The filter executes in a special filter context, which is different and more limited than a `ValkeyModuleCtx`. Because the filter affects any command, it @@ -6911,10 +6911,10 @@ Here is a list of events you can use as 'eid' and related sub events: * `ValkeyModuleEvent_ReplicationRoleChanged`: - This event is called when the instance switches from master + This event is called when the instance switches from primary to replica or the other way around, however the event is also called when the replica remains a replica but starts to - replicate with a different master. + replicate with a different primary. The following sub events are available: @@ -6924,9 +6924,9 @@ Here is a list of events you can use as 'eid' and related sub events: The 'data' field can be casted by the callback to a `ValkeyModuleReplicationInfo` structure with the following fields: - int master; // true if master, false if replica - char *masterhost; // master instance hostname for NOW_REPLICA - int masterport; // master instance port for NOW_REPLICA + int master; // true if primary, false if replica + char *masterhost; // primary instance hostname for NOW_REPLICA + int masterport; // primary instance port for NOW_REPLICA char *replid1; // Main replication ID char *replid2; // Secondary replication ID uint64_t repl1_offset; // Main replication offset @@ -6983,7 +6983,7 @@ Here is a list of events you can use as 'eid' and related sub events: Called on loading operations: at startup when the server is started, but also after a first synchronization when the - replica is loading the RDB file from the master. + replica is loading the RDB file from the primary. The following sub events are available: * `VALKEYMODULE_SUBEVENT_LOADING_RDB_START` @@ -7012,7 +7012,7 @@ Here is a list of events you can use as 'eid' and related sub events: * `ValkeyModuleEvent_ReplicaChange` This event is called when the instance (that can be both a - master or a replica) get a new online replica, or lose a + primary or a replica) get a new online replica, or lose a replica since it gets disconnected. The following sub events are available: @@ -7040,9 +7040,9 @@ Here is a list of events you can use as 'eid' and related sub events: * `ValkeyModuleEvent_PrimaryLinkChange` This is called for replicas in order to notify when the - replication link becomes functional (up) with our master, + replication link becomes functional (up) with our primary, or when it goes down. Note that the link is not considered - up when we just connected to the master, but only if the + up when we just connected to the primary, but only if the replication is happening correctly. The following sub events are available: @@ -7110,7 +7110,7 @@ Here is a list of events you can use as 'eid' and related sub events: * `ValkeyModuleEvent_ReplAsyncLoad` - Called when repl-diskless-load config is set to swapdb and a replication with a master of same + Called when repl-diskless-load config is set to swapdb and a replication with a primary of the same data set history (matching replication ID) occurs. In which case the server serves current data set while loading new database in memory from socket. Modules must have declared they support this mechanism in order to activate it, through diff --git a/topics/protocol.md b/topics/protocol.md index c92611d3..444d2554 100644 --- a/topics/protocol.md +++ b/topics/protocol.md @@ -574,7 +574,7 @@ In Valkey' RESP3 implementation, the following fields are also emitted: * **id**: the connection's identifier (ID). * **mode**: "standalone", "sentinel" or "cluster". -* **role**: "master" or "replica". +* **role**: "primary" or "replica". * **modules**: list of loaded modules as an Array of Bulk Strings. ## Sending commands to a Valkey server diff --git a/topics/pubsub.md b/topics/pubsub.md index 79c17567..4b64f22a 100644 --- a/topics/pubsub.md +++ b/topics/pubsub.md @@ -177,7 +177,7 @@ So the client will exit the Pub/Sub state only when this count drops to zero as From Redis OSS 7.0, sharded Pub/Sub is introduced in which shard channels are assigned to slots by the same algorithm used to assign keys to slots. A shard message must be sent to a node that owns the slot the shard channel is hashed to. -The cluster makes sure the published shard messages are forwarded to all nodes in the shard, so clients can subscribe to a shard channel by connecting to either the master responsible for the slot, or to any of its replicas. +The cluster makes sure the published shard messages are forwarded to all nodes in the shard, so clients can subscribe to a shard channel by connecting to either the primary responsible for the slot, or to any of its replicas. `SSUBSCRIBE`, `SUNSUBSCRIBE` and `SPUBLISH` are used to implement sharded Pub/Sub. Sharded Pub/Sub helps to scale the usage of Pub/Sub in cluster mode. diff --git a/topics/replication.md b/topics/replication.md index 7fa01b16..2236b9ad 100644 --- a/topics/replication.md +++ b/topics/replication.md @@ -4,18 +4,18 @@ linkTitle: Replication description: How Valkey supports high availability and failover with replication --- -At the base of Valkey replication (excluding the high availability features provided as an additional layer by Valkey Cluster or Valkey Sentinel) there is a *leader follower* (master-replica) replication that is simple to use and configure. It allows replica Valkey instances to be exact copies of master instances. The replica will automatically reconnect to the master every time the link breaks, and will attempt to be an exact copy of it *regardless* of what happens to the master. +At the base of Valkey replication (excluding the high availability features provided as an additional layer by Valkey Cluster or Valkey Sentinel) there is a *leader follower* (primary-replica) replication that is simple to use and configure. It allows replica Valkey instances to be exact copies of primary instances. The replica will automatically reconnect to the primary every time the link breaks, and will attempt to be an exact copy of it *regardless* of what happens to the primary. This system works using three main mechanisms: -1. When a master and a replica instances are well-connected, the master keeps the replica updated by sending a stream of commands to the replica to replicate the effects on the dataset happening in the master side due to: client writes, keys expired or evicted, any other action changing the master dataset. -2. When the link between the master and the replica breaks, for network issues or because a timeout is sensed in the master or the replica, the replica reconnects and attempts to proceed with a partial resynchronization: it means that it will try to just obtain the part of the stream of commands it missed during the disconnection. -3. When a partial resynchronization is not possible, the replica will ask for a full resynchronization. This will involve a more complex process in which the master needs to create a snapshot of all its data, send it to the replica, and then continue sending the stream of commands as the dataset changes. +1. When a primary and a replica instances are well-connected, the primary keeps the replica updated by sending a stream of commands to the replica to replicate the effects on the dataset happening in the primary side due to: client writes, keys expired or evicted, any other action changing the primary dataset. +2. When the link between the primary and the replica breaks, for network issues or because a timeout is sensed in the primary or the replica, the replica reconnects and attempts to proceed with a partial resynchronization: it means that it will try to just obtain the part of the stream of commands it missed during the disconnection. +3. When a partial resynchronization is not possible, the replica will ask for a full resynchronization. This will involve a more complex process in which the primary needs to create a snapshot of all its data, send it to the replica, and then continue sending the stream of commands as the dataset changes. Valkey uses by default asynchronous replication, which being low latency and high performance, is the natural replication mode for the vast majority of Valkey use cases. However, Valkey replicas asynchronously acknowledge the amount of data -they received periodically with the master. So the master does not wait every time +they received periodically with the primary. So the primary does not wait every time for a command to be processed by the replicas, however it knows, if needed, what replica already processed what command. This allows having optional synchronous replication. @@ -33,38 +33,38 @@ about high availability and failover. The rest of this document mainly describes ### Important facts about Valkey replication -* Valkey uses asynchronous replication, with asynchronous replica-to-master acknowledges of the amount of data processed. -* A master can have multiple replicas. -* Replicas are able to accept connections from other replicas. Aside from connecting a number of replicas to the same master, replicas can also be connected to other replicas in a cascading-like structure. All the sub-replicas will receive exactly the same replication stream from the master. -* Valkey replication is non-blocking on the master side. This means that the master will continue to handle queries when one or more replicas perform the initial synchronization or a partial resynchronization. +* Valkey uses asynchronous replication, with asynchronous replica-to-primary acknowledges of the amount of data processed. +* A primary can have multiple replicas. +* Replicas are able to accept connections from other replicas. Aside from connecting a number of replicas to the same primary, replicas can also be connected to other replicas in a cascading-like structure. All the sub-replicas will receive exactly the same replication stream from the primary. +* Valkey replication is non-blocking on the primary side. This means that the primary will continue to handle queries when one or more replicas perform the initial synchronization or a partial resynchronization. * Replication is also largely non-blocking on the replica side. While the replica is performing the initial synchronization, it can handle queries using the old version of the dataset, assuming you configured Valkey to do so in valkey.conf. Otherwise, you can configure Valkey replicas to return an error to clients if the replication stream is down. However, after the initial sync, the old dataset must be deleted and the new one must be loaded. The replica will block incoming connections during this brief window (that can be as long as many seconds for very large datasets). You can configure Valkey so that the deletion of the old data set happens in a different thread, however loading the new initial dataset will still happen in the main thread and block the replica. * Replication can be used both for scalability, to have multiple replicas for read-only queries (for example, slow O(N) operations can be offloaded to replicas), or simply for improving data safety and high availability. -* You can use replication to avoid the cost of having the master writing the full dataset to disk: a typical technique involves configuring your master `valkey.conf` to avoid persisting to disk at all, then connect a replica configured to save from time to time, or with AOF enabled. However, this setup must be handled with care, since a restarting master will start with an empty dataset: if the replica tries to sync with it, the replica will be emptied as well. +* You can use replication to avoid the cost of having the primary writing the full dataset to disk: a typical technique involves configuring your primary's `valkey.conf` to avoid persisting to disk at all, then connect a replica configured to save from time to time, or with AOF enabled. However, this setup must be handled with care, since a restarting primary will start with an empty dataset: if the replica tries to sync with it, the replica will be emptied as well. -## Safety of replication when master has persistence turned off +## Safety of replication when primary has persistence turned off In setups where Valkey replication is used, it is strongly advised to have -persistence turned on in the master and in the replicas. When this is not possible, +persistence turned on in the primary and in the replicas. When this is not possible, for example because of latency concerns due to very slow disks, instances should be configured to **avoid restarting automatically** after a reboot. -To better understand why masters with persistence turned off configured to +To better understand why primaries with persistence turned off configured to auto restart are dangerous, check the following failure mode where data -is wiped from the master and all its replicas: +is wiped from the primary and all its replicas: -1. We have a setup with node A acting as master, with persistence turned down, and nodes B and C replicating from node A. +1. We have a setup with node A acting as primary, with persistence turned down, and nodes B and C replicating from node A. 2. Node A crashes, however it has some auto-restart system, that restarts the process. However since persistence is turned off, the node restarts with an empty data set. 3. Nodes B and C will replicate from node A, which is empty, so they'll effectively destroy their copy of the data. When Valkey Sentinel is used for high availability, also turning off persistence -on the master, together with auto restart of the process, is dangerous. For example the master can restart fast enough for Sentinel to not detect a failure, so that the failure mode described above happens. +on the primary, together with auto restart of the process, is dangerous. For example, the primary can restart fast enough for Sentinel to not detect a failure, so that the failure mode described above happens. -Every time data safety is important, and replication is used with master configured without persistence, auto restart of instances should be disabled. +Every time data safety is important, and replication is used with primary configured without persistence, auto restart of instances should be disabled. ## How Valkey replication works -Every Valkey master has a replication ID: it is a large pseudo random string -that marks a given story of the dataset. Each master also takes an offset that +Every Valkey primary has a replication ID: it is a large pseudo random string +that marks a given story of the dataset. Each primary also takes an offset that increments for every byte of replication stream that it is produced to be sent to replicas, to update the state of the replicas with the new changes modifying the dataset. The replication offset is incremented even if no replica @@ -72,27 +72,27 @@ is actually connected, so basically every given pair of: Replication ID, offset -Identifies an exact version of the dataset of a master. +Identifies an exact version of the dataset of a primary. -When replicas connect to masters, they use the `PSYNC` command to send -their old master replication ID and the offsets they processed so far. This way -the master can send just the incremental part needed. However if there is not -enough *backlog* in the master buffers, or if the replica is referring to an +When replicas connect to primaries, they use the `PSYNC` command to send +their old primary replication ID and the offsets they processed so far. This way +the primary can send just the incremental part needed. However if there is not +enough *backlog* in the primary buffers, or if the replica is referring to a history (replication ID) which is no longer known, then a full resynchronization happens: in this case the replica will get a full copy of the dataset, from scratch. This is how a full synchronization works in more details: -The master starts a background saving process to produce an RDB file. At the same time it starts to buffer all new write commands received from the clients. When the background saving is complete, the master transfers the database file to the replica, which saves it on disk, and then loads it into memory. The master will then send all buffered commands to the replica. This is done as a stream of commands and is in the same format of the Valkey protocol itself. +The primary starts a background saving process to produce an RDB file. At the same time it starts to buffer all new write commands received from the clients. When the background saving is complete, the primary transfers the database file to the replica, which saves it on disk, and then loads it into memory. The primary will then send all buffered commands to the replica. This is done as a stream of commands and is in the same format of the Valkey protocol itself. You can try it yourself via telnet. Connect to the Valkey port while the server is doing some work and issue the `SYNC` command. You'll see a bulk -transfer and then every command received by the master will be re-issued +transfer and then every command received by the primary will be re-issued in the telnet session. Actually `SYNC` is an old protocol no longer used by newer Valkey instances, but is still there for backward compatibility: it does not allow partial resynchronizations, so now `PSYNC` is used instead. -As already said, replicas are able to automatically reconnect when the master-replica link goes down for some reason. If the master receives multiple concurrent replica synchronization requests, it performs a single background save in to serve all of them. +As already said, replicas are able to automatically reconnect when the primary-replica link goes down for some reason. If the primary receives multiple concurrent replica synchronization requests, it performs a single background save in to serve all of them. ## Replication ID explained @@ -102,9 +102,9 @@ to understand what exactly is the replication ID, and why instances have actuall two replication IDs: the main ID and the secondary ID. A replication ID basically marks a given *history* of the data set. Every time -an instance restarts from scratch as a master, or a replica is promoted to master, +an instance restarts from scratch as a primary, or a replica is promoted to primary, a new replication ID is generated for this instance. The replicas connected to -a master will inherit its replication ID after the handshake. So two instances +a primary will inherit its replication ID after the handshake. So two instances with the same ID are related by the fact that they hold the same data, but potentially at a different time. It is the offset that works as a logical time to understand, for a given history (replication ID), who holds the most updated @@ -116,22 +116,22 @@ commands applied to the data set. It also means that A, by applying just a few commands, may reach exactly the same state of B. The reason why Valkey instances have two replication IDs is because of replicas -that are promoted to masters. After a failover, the promoted replica requires +that are promoted to primaries. After a failover, the promoted replica requires to still remember what was its past replication ID, because such replication ID -was the one of the former master. In this way, when other replicas will sync -with the new master, they will try to perform a partial resynchronization using the -old master replication ID. This will work as expected, because when the replica -is promoted to master it sets its secondary ID to its main ID, remembering what +was the one of the former primary. In this way, when other replicas will sync +with the new primary, they will try to perform a partial resynchronization using the +old primary replication ID. This will work as expected, because when the replica +is promoted to primary, it sets its secondary ID to its main ID, remembering what was the offset when this ID switch happened. Later it will select a new random replication ID, because a new history begins. When handling the new replicas -connecting, the master will match their IDs and offsets both with the current +connecting, the primary will match their IDs and offsets both with the current ID and the secondary ID (up to a given offset, for safety). In short this means -that after a failover, replicas connecting to the newly promoted master don't have +that after a failover, replicas connecting to the newly promoted primary don't have to perform a full sync. -In case you wonder why a replica promoted to master needs to change its -replication ID after a failover: it is possible that the old master is still -working as a master because of some network partition: retaining the same +In case you wonder why a replica promoted to primary needs to change its +replication ID after a failover: it is possible that the old primary is still +working as a primary because of some network partition: retaining the same replication ID would violate the fact that the same ID and same offset of any two random instances mean they have the same data set. @@ -140,7 +140,7 @@ two random instances mean they have the same data set. Normally a full resynchronization requires creating an RDB file on disk, then reloading the same RDB from disk to feed the replicas with the data. -With slow disks this can be a very stressing operation for the master. +With slow disks this can be a very stressing operation for the primary. Valkey has support for diskless replication. In this setup the child process directly sends the RDB over the wire to replicas, without using the disk as intermediate storage. @@ -151,12 +151,12 @@ To configure basic Valkey replication is trivial: just add the following line to replicaof 192.168.1.1 6379 -Of course you need to replace 192.168.1.1 6379 with your master IP address (or +Of course you need to replace 192.168.1.1 6379 with your primary IP address (or hostname) and port. Alternatively, you can call the `REPLICAOF` command and the -master host will start a sync with the replica. +primary host will start a sync with the replica. There are also a few parameters for tuning the replication backlog taken -in memory by the master to perform the partial resynchronization. See the example +in memory by the primary to perform the partial resynchronization. See the example `valkey.conf` shipped with the Valkey distribution for more information. Diskless replication can be enabled using the `repl-diskless-sync` configuration @@ -175,19 +175,19 @@ Read-only replicas will reject all write commands, so that it is not possible to You may wonder why it is possible to revert the read-only setting and have replica instances that can be targeted by write operations. The answer is that writable replicas exist only for historical reasons. -Using writable replicas can result in inconsistency between the master and the replica, so it is not recommended to use writable replicas. +Using writable replicas can result in inconsistency between the primary and the replica, so it is not recommended to use writable replicas. To understand in which situations this can be a problem, we need to understand how replication works. -Changes on the master is replicated by propagating regular Valkey commands to the replica. -When a key expires on the master, this is propagated as a DEL command. -If a key which exists on the master but is deleted, expired or has a different type on the replica compared to the master will react differently to commands like DEL, INCR or RPOP propagated from the master than intended. +Changes on the primary is replicated by propagating regular Valkey commands to the replica. +When a key expires on the primary, this is propagated as a DEL command. +If a key which exists on the primary but is deleted, expired or has a different type on the replica compared to the primary will react differently to commands like DEL, INCR or RPOP propagated from the primary than intended. The propagated command may fail on the replica or result in a different outcome. To minimize the risks (if you insist on using writable replicas) we suggest you follow these recommendations: -* Don't write to keys in a writable replica that are also used on the master. - (This can be hard to guarantee if you don't have control over all the clients that write to the master.) +* Don't write to keys in a writable replica that are also used on the primary. + (This can be hard to guarantee if you don't have control over all the clients that write to the primary.) * Don't configure an instance as a writable replica as an intermediary step when upgrading a set of instances in a running system. - In general, don't configure an instance as a writable replica if it can ever be promoted to a master if you want to guarantee data consistency. + In general, don't configure an instance as a writable replica if it can ever be promoted to a primary if you want to guarantee data consistency. Historically, there were some use cases that were considered legitimate for writable replicas. As of version 7.0, these use cases are now all obsolete and the same can be achieved by other means. @@ -202,37 +202,37 @@ For example: * Using `EVAL` and `EVALSHA` are also not considered read-only commands, because the Lua script may call write commands. Instead, use `EVAL_RO` and `EVALSHA_RO` where the Lua script can only call read-only commands. -While writes to a replica will be discarded if the replica and the master resync or if the replica is restarted, there is no guarantee that they will sync automatically. +While writes to a replica will be discarded if the replica and the primary resync or if the replica is restarted, there is no guarantee that they will sync automatically. Before version 4.0, writable replicas were incapable of expiring keys with a time to live set. This means that if you use `EXPIRE` or other commands that set a maximum TTL for a key, the key will leak, and while you may no longer see it while accessing it with read commands, you will see it in the count of keys and it will still use memory. -Valkey is able to evict keys with TTL as masters do, with the exceptions of keys written in DB numbers greater than 63 (but by default Valkey instances only have 16 databases). -Note though that even in versions greater than 4.0, using `EXPIRE` on a key that could ever exists on the master can cause inconsistency between the replica and the master. +Valkey is able to evict keys with TTL as primaries do, with the exceptions of keys written in DB numbers greater than 63 (but by default Valkey instances only have 16 databases). +Note though that even in versions greater than 4.0, using `EXPIRE` on a key that could ever exists on the primary can cause inconsistency between the replica and the primary. -Also note that replica writes are only local, and are not propagated to sub-replicas attached to the instance. Sub-replicas instead will always receive the replication stream identical to the one sent by the top-level master to the intermediate replicas. So for example in the following setup: +Also note that replica writes are only local, and are not propagated to sub-replicas attached to the instance. Sub-replicas instead will always receive the replication stream identical to the one sent by the top-level primary to the intermediate replicas. So for example in the following setup: A ---> B ---> C -Even if `B` is writable, C will not see `B` writes and will instead have identical dataset as the master instance `A`. +Even if `B` is writable, C will not see `B` writes and will instead have identical dataset as the primary instance `A`. -## Setting a replica to authenticate to a master +## Setting a replica to authenticate to a primary -If your master has a password via `requirepass`, it's trivial to configure the +If your primary has a password via `requirepass`, it's trivial to configure the replica to use that password in all sync operations. To do it on a running instance, use `valkey-cli` and type: - config set masterauth + config set primaryauth To set it permanently, add this to your config file: - masterauth + primaryauth ## Allow writes only with N attached replicas -You can configure a Valkey master to +You can configure a Valkey primary to accept write queries only if at least N replicas are currently connected to the -master. +primary. However, because Valkey uses asynchronous replication it is not possible to ensure the replica actually received a given write, so there is always a window for data @@ -240,15 +240,15 @@ loss. This is how the feature works: -* Valkey replicas ping the master every second, acknowledging the amount of replication stream processed. -* Valkey masters will remember the last time it received a ping from every replica. +* Valkey replicas ping the primary every second, acknowledging the amount of replication stream processed. +* Valkey primaries will remember the last time it received a ping from every replica. * The user can configure a minimum number of replicas that have a lag not greater than a maximum number of seconds. If there are at least N replicas, with a lag less than M seconds, then the write will be accepted. You may think of it as a best effort data safety mechanism, where consistency is not ensured for a given write, but at least the time window for data loss is restricted to a given number of seconds. In general bound data loss is better than unbound one. -If the conditions are not met, the master will instead reply with an error and the write will not be accepted. +If the conditions are not met, the primary will instead reply with an error and the write will not be accepted. There are two configuration parameters for this feature: @@ -265,25 +265,25 @@ on the ability of an instance to count the time, however Valkey replicas correct replicate keys with expires, even when such keys are altered using Lua scripts. -To implement such a feature Valkey cannot rely on the ability of the master and +To implement such a feature Valkey cannot rely on the ability of the primary and replica to have synced clocks, since this is a problem that cannot be solved and would result in race conditions and diverging data sets, so Valkey uses three main techniques to make the replication of expired keys able to work: -1. Replicas don't expire keys, instead they wait for masters to expire the keys. When a master expires a key (or evict it because of LRU), it synthesizes a `DEL` command which is transmitted to all the replicas. -2. However because of master-driven expire, sometimes replicas may still have in memory keys that are already logically expired, since the master was not able to provide the `DEL` command in time. To deal with that the replica uses its logical clock to report that a key does not exist **only for read operations** that don't violate the consistency of the data set (as new commands from the master will arrive). In this way replicas avoid reporting logically expired keys that are still existing. In practical terms, an HTML fragments cache that uses replicas to scale will avoid returning items that are already older than the desired time to live. -3. During Lua scripts executions no key expiries are performed. As a Lua script runs, conceptually the time in the master is frozen, so that a given key will either exist or not for all the time the script runs. This prevents keys expiring in the middle of a script, and is needed to send the same script to the replica in a way that is guaranteed to have the same effects in the data set. +1. Replicas don't expire keys, instead they wait for primaries to expire the keys. When a primary expires a key (or evicts it because of LRU), it synthesizes a `DEL` command which is transmitted to all the replicas. +2. However because of primary-driven expire, sometimes replicas may still have in memory keys that are already logically expired, since the primary was not able to provide the `DEL` command in time. To deal with that the replica uses its logical clock to report that a key does not exist **only for read operations** that don't violate the consistency of the data set (as new commands from the primary will arrive). In this way replicas avoid reporting logically expired keys that are still existing. In practical terms, an HTML fragments cache that uses replicas to scale will avoid returning items that are already older than the desired time to live. +3. During Lua scripts executions no key expiries are performed. As a Lua script runs, conceptually the time in the primary is frozen, so that a given key will either exist or not for all the time the script runs. This prevents keys expiring in the middle of a script, and is needed to send the same script to the replica in a way that is guaranteed to have the same effects in the data set. -Once a replica is promoted to a master it will start to expire keys independently, and will not require any help from its old master. +Once a replica is promoted to a primary it will start to expire keys independently, and will not require any help from its old primary. ## Configuring replication in Docker and NAT -When Docker, or other types of containers using port forwarding, or Network Address Translation is used, Valkey replication needs some extra care, especially when using Valkey Sentinel or other systems where the master `INFO` or `ROLE` commands output is scanned to discover replicas' addresses. +When Docker, or other types of containers using port forwarding, or Network Address Translation is used, Valkey replication needs some extra care, especially when using Valkey Sentinel or other systems where the primary `INFO` or `ROLE` commands output is scanned to discover replicas' addresses. The problem is that the `ROLE` command, and the replication section of -the `INFO` output, when issued into a master instance, will show replicas -as having the IP address they use to connect to the master, which, in +the `INFO` output, when issued into a primary instance, will show replicas +as having the IP address they use to connect to the primary, which, in environments using NAT may be different compared to the logical address of the replica instance (the one that clients should use to connect to replicas). @@ -292,7 +292,7 @@ into `valkey.conf`, that may be different from the forwarded port in case the port is remapped. To fix both issues, it is possible to force -a replica to announce an arbitrary pair of IP and port to the master. +a replica to announce an arbitrary pair of IP and port to the primary. The two configurations directives to use are: replica-announce-ip 5.5.5.5 @@ -303,30 +303,30 @@ And are documented in the example `valkey.conf` of recent Valkey distributions. ## The INFO and ROLE command There are two Valkey commands that provide a lot of information on the current -replication parameters of master and replica instances. One is `INFO`. If the +replication parameters of primary and replica instances. One is `INFO`. If the command is called with the `replication` argument as `INFO replication` only information relevant to the replication are displayed. Another more computer-friendly command is `ROLE`, that provides the replication status of -masters and replicas together with their replication offsets, list of connected +primaries and replicas together with their replication offsets, list of connected replicas and so forth. ## Partial sync after restarts and failovers -When an instance is promoted to master after a failover, +When an instance is promoted to primary after a failover, it will still be able to perform a partial resynchronization with the replicas -of the old master. To do so, the replica remembers the old replication ID and -offset of its former master, so can provide part of the backlog to the connecting +of the old primary. To do so, the replica remembers the old replication ID and +offset of its former primary, so can provide part of the backlog to the connecting replicas even if they ask for the old replication ID. However the new replication ID of the promoted replica will be different, since it -constitutes a different history of the data set. For example, the master can +constitutes a different history of the data set. For example, the primary can return available and can continue accepting writes for some time, so using the same replication ID in the promoted replica would violate the rule that a replication ID and offset pair identifies only a single data set. Moreover, replicas - when powered off gently and restarted - are able to store in the `RDB` file the information needed to resync with their -master. This is useful in case of upgrades. When this is needed, it is better to +primary. This is useful in case of upgrades. When this is needed, it is better to use the `SHUTDOWN` command in order to perform a `save & quit` operation on the replica. @@ -336,14 +336,14 @@ down it, than can be restarted, and finally AOF can be enabled again. ## `Maxmemory` on replicas -By default, a replica will ignore `maxmemory` (unless it is promoted to master after a failover or manually). -It means that the eviction of keys will be handled by the master, sending the DEL commands to the replica as keys evict in the master side. +By default, a replica will ignore `maxmemory` (unless it is promoted to a primary after a failover or manually). +It means that the eviction of keys will be handled by the primary, sending the DEL commands to the replica as keys evict in the primary side. -This behavior ensures that masters and replicas stay consistent, which is usually what you want. +This behavior ensures that primaries and replicas stay consistent, which is usually what you want. However, if your replica is writable, or you want the replica to have a different memory setting, and you are sure all the writes performed to the replica are idempotent, then you may change this default (but be sure to understand what you are doing). Note that since the replica by default does not evict, it may end up using more memory than what is set via `maxmemory` (since there are certain buffers that may be larger on the replica, or data structures may sometimes take more memory and so forth). -Make sure you monitor your replicas, and make sure they have enough memory to never hit a real out-of-memory condition before the master hits the configured `maxmemory` setting. +Make sure you monitor your replicas, and make sure they have enough memory to never hit a real out-of-memory condition before the primary hits the configured `maxmemory` setting. To change this behavior, you can allow a replica to not ignore the `maxmemory`. The configuration directives to use is: diff --git a/topics/sentinel-clients.md b/topics/sentinel-clients.md index be54dee7..25b6a040 100644 --- a/topics/sentinel-clients.md +++ b/topics/sentinel-clients.md @@ -5,10 +5,10 @@ description: How to build clients for Valkey Sentinel --- Valkey Sentinel is a monitoring solution for Valkey instances that handles -automatic failover of Valkey masters and service discovery (who is the current -master for a given group of instances?). Since Sentinel is both responsible +automatic failover of Valkey primaries and service discovery (who is the current +primary for a given group of instances?). Since Sentinel is both responsible for reconfiguring instances during failovers, and providing configurations to -clients connecting to Valkey masters or replicas, clients are required to have +clients connecting to Valkey primaries or replicas, clients are required to have explicit support for Valkey Sentinel. This document is targeted at Valkey clients developers that want to support Sentinel in their clients implementation with the following goals: @@ -20,22 +20,22 @@ For details about how Valkey Sentinel works, please check the [Valkey Documentat ## Valkey service discovery via Sentinel -Valkey Sentinel identifies every master with a name like "stats" or "cache". -Every name actually identifies a *group of instances*, composed of a master +Valkey Sentinel identifies every primary with a name like "stats" or "cache". +Every name actually identifies a *group of instances*, composed of a primary and a variable number of replicas. -The address of the Valkey master that is used for a specific purpose inside a network may change after events like an automatic failover, a manually triggered failover (for instance in order to upgrade a Valkey instance), and other reasons. +The address of the Valkey primary that is used for a specific purpose inside a network may change after events like an automatic failover, a manually triggered failover (for instance in order to upgrade a Valkey instance), and other reasons. -Normally Valkey clients have some kind of hard-coded configuration that specifies the address of a Valkey master instance within a network as IP address and port number. However if the master address changes, manual intervention in every client is needed. +Normally Valkey clients have some kind of hard-coded configuration that specifies the address of a Valkey primary instance within a network as IP address and port number. However if the primary address changes, manual intervention in every client is needed. -A Valkey client supporting Sentinel can automatically discover the address of a Valkey master from the master name using Valkey Sentinel. So instead of a hard coded IP address and port, a client supporting Sentinel should optionally be able to take as input: +A Valkey client supporting Sentinel can automatically discover the address of a Valkey primary from the primary name using Valkey Sentinel. So instead of a hard coded IP address and port, a client supporting Sentinel should optionally be able to take as input: * A list of ip:port pairs pointing to known Sentinel instances. * The name of the service, like "cache" or "timelines". -This is the procedure a client should follow in order to obtain the master address starting from the list of Sentinels and the service name. +This is the procedure a client should follow in order to obtain the primary address starting from the list of Sentinels and the service name. -Step 1: connecting to the first Sentinel +Step 1: connect to the first Sentinel --- The client should iterate the list of Sentinel addresses. For every address it should try to connect to the Sentinel, using a short timeout (in the order of a few hundreds of milliseconds). On errors or timeouts the next Sentinel address should be tried. @@ -44,7 +44,7 @@ If all the Sentinel addresses were tried without success, an error should be ret The first Sentinel replying to the client request should be put at the start of the list, so that at the next reconnection, we'll try first the Sentinel that was reachable in the previous connection attempt, minimizing latency. -Step 2: ask for master address +Step 2: ask for primary address --- Once a connection with a Sentinel is established, the client should retry to execute the following command on the Sentinel: @@ -56,43 +56,43 @@ Where *master-name* should be replaced with the actual service name specified by The result from this call can be one of the following two replies: * An ip:port pair. -* A null reply. This means Sentinel does not know this master. +* A null reply. This means Sentinel does not know this primary. -If an ip:port pair is received, this address should be used to connect to the Valkey master. Otherwise if a null reply is received, the client should try the next Sentinel in the list. +If an ip:port pair is received, this address should be used to connect to the Valkey primary. Otherwise if a null reply is received, the client should try the next Sentinel in the list. Step 3: call the ROLE command in the target instance --- -Once the client discovered the address of the master instance, it should -attempt a connection with the master, and call the `ROLE` command in order -to verify the role of the instance is actually a master. +Once the client discovered the address of the primary instance, it should +attempt a connection with the primary, and call the `ROLE` command in order +to verify the role of the instance is actually a primary. -If the instance is not a master as expected, the client should wait a short amount of time (a few hundreds of milliseconds) and should try again starting from Step 1. +If the instance is not a primary as expected, the client should wait a short amount of time (a few hundreds of milliseconds) and should try again starting from Step 1. Handling reconnections === -Once the service name is resolved into the master address and a connection is established with the Valkey master instance, every time a reconnection is needed, the client should resolve again the address using Sentinels restarting from Step 1. For instance Sentinel should contacted again the following cases: +Once the service name is resolved into the primary address and a connection is established with the Valkey primary instance, every time a reconnection is needed, the client should resolve again the address using Sentinels restarting from Step 1. For instance Sentinel should contacted again the following cases: * If the client reconnects after a timeout or socket error. * If the client reconnects because it was explicitly closed or reconnected by the user. -In the above cases and any other case where the client lost the connection with the Valkey server, the client should resolve the master address again. +In the above cases and any other case where the client lost the connection with the Valkey server, the client should resolve the primary address again. Sentinel failover disconnection === When Valkey Sentinel changes the configuration of -an instance, for example promoting a replica to a master, demoting a master to -replicate to the new master after a failover, or simply changing the master +an instance, for example promoting a replica to a primary, demoting a primary to +replicate to the new primary after a failover, or simply changing the primary address of a stale replica instance, it sends a `CLIENT KILL type normal` command to the instance in order to make sure all the clients are disconnected -from the reconfigured instance. This will force clients to resolve the master +from the reconfigured instance. This will force clients to resolve the primary address again. If the client will contact a Sentinel with yet not updated information, the verification of the Valkey instance role via the `ROLE` command will fail, allowing the client to detect that the contacted Sentinel provided stale information, and will try again. -Note: it is possible that a stale master returns online at the same time a client contacts a stale Sentinel instance, so the client may connect with a stale master, and yet the ROLE output will match. However when the master is back again Sentinel will try to demote it to replica, triggering a new disconnection. The same reasoning applies to connecting to stale replicas that will get reconfigured to replicate with a different master. +Note: it is possible that a stale primary returns online at the same time a client contacts a stale Sentinel instance, so the client may connect with a stale primary, and yet the ROLE output will match. However when the primary is back again Sentinel will try to demote it to replica, triggering a new disconnection. The same reasoning applies to connecting to stale replicas that will get reconfigured to replicate with a different primary. Connecting to replicas === @@ -103,18 +103,18 @@ Sometimes clients are interested to connect to replicas, for example in order to The clients should call instead: - SENTINEL replicas master-name + SENTINEL replicas primary-name In order to retrieve a list of replica instances. Symmetrically the client should verify with the `ROLE` command that the instance is actually a replica, in order to avoid scaling read queries with -the master. +the primary. Connection pools === -For clients implementing connection pools, on reconnection of a single connection, the Sentinel should be contacted again, and in case of a master address change all the existing connections should be closed and connected to the new address. +For clients implementing connection pools, on reconnection of a single connection, the Sentinel should be contacted again, and in case of a primary address change all the existing connections should be closed and connected to the new address. Error reporting === @@ -122,14 +122,14 @@ Error reporting The client should correctly return the information to the user in case of errors. Specifically: * If no Sentinel can be contacted (so that the client was never able to get the reply to `SENTINEL get-master-addr-by-name`), an error that clearly states that Valkey Sentinel is unreachable should be returned. -* If all the Sentinels in the pool replied with a null reply, the user should be informed with an error that Sentinels don't know this master name. +* If all the Sentinels in the pool replied with a null reply, the user should be informed with an error that Sentinels don't know this primary name. Sentinels list automatic refresh === Optionally once a successful reply to `get-master-addr-by-name` is received, a client may update its internal list of Sentinel nodes following this procedure: -* Obtain a list of other Sentinels for this master using the command `SENTINEL sentinels `. +* Obtain a list of other Sentinels for this primary using the command `SENTINEL sentinels `. * Add every ip:port pair not already existing in our list at the end of the list. It is not needed for a client to be able to make the list persistent updating its own configuration. The ability to upgrade the in-memory representation of the list of Sentinels can be already useful to improve reliability. @@ -144,8 +144,10 @@ Valkey instances configurations. This mechanism can be used in order to speedup the reconfiguration of clients, that is, clients may listen to Pub/Sub in order to know when a configuration change happened in order to run the three steps protocol explained in this -document in order to resolve the new Valkey master (or replica) address. +document in order to resolve the new Valkey primary (or replica) address. However update messages received via Pub/Sub should not substitute the above procedure, since there is no guarantee that a client is able to receive all the update messages. + +> NOTE: If not for backward compatibility, the Valkey project no longer uses the words "master" and "slave". Unfortunately in the given commands these words are part of the protocol, so we’ll be able to remove such occurrences only when this API will be naturally deprecated. diff --git a/topics/sentinel.md b/topics/sentinel.md index 4bcab42d..f09d43cf 100644 --- a/topics/sentinel.md +++ b/topics/sentinel.md @@ -19,10 +19,10 @@ notifications and acts as a configuration provider for clients. This is the full list of Sentinel capabilities at a macroscopic level (i.e. the *big picture*): -* **Monitoring**. Sentinel constantly checks if your master and replica instances are working as expected. -* **Notification**. Sentinel can notify the system administrator, or other computer programs, via an API, that something is wrong with one of the monitored Valkey instances. -* **Automatic failover**. If a master is not working as expected, Sentinel can start a failover process where a replica is promoted to master, the other additional replicas are reconfigured to use the new master, and the applications using the Valkey server are informed about the new address to use when connecting. -* **Configuration provider**. Sentinel acts as a source of authority for clients service discovery: clients connect to Sentinels in order to ask for the address of the current Valkey master responsible for a given service. If a failover occurs, Sentinels will report the new address. +* **Monitoring**. Sentinel constantly checks if your primary and replica instances are working as expected. +* **Notification**. Sentinel can notify the system administrator, or other computer programs, via the API, that something is wrong with one of the monitored Valkey instances. +* **Automatic failover**. If a primary is not working as expected, Sentinel can start a failover process where a replica is promoted to primary, the other additional replicas are reconfigured to use the new primary, and the applications using the Valkey server are informed about the new address to use when connecting. +* **Configuration provider**. Sentinel acts as a source of authority for clients service discovery: clients connect to Sentinels in order to ask for the address of the current Valkey primary responsible for a given service. If a failover occurs, Sentinels will report the new address. ## Sentinel as a distributed system @@ -30,10 +30,10 @@ Valkey Sentinel is a distributed system: Sentinel itself is designed to run in a configuration where there are multiple Sentinel processes cooperating together. The advantage of having multiple Sentinel processes cooperating are the following: -1. Failure detection is performed when multiple Sentinels agree about the fact a given master is no longer available. This lowers the probability of false positives. +1. Failure detection is performed when multiple Sentinels agree about the fact a given primary is no longer available. This lowers the probability of false positives. 2. Sentinel works even if not all the Sentinel processes are working, making the system robust against failures. There is no fun in having a failover system which is itself a single point of failure, after all. -The sum of Sentinels, Valkey instances (masters and replicas) and clients +The sum of Sentinels, Valkey instances (primaries and replicas) and clients connecting to Sentinel and Valkey, are also a larger distributed system with specific properties. In this document concepts will be introduced gradually starting from basic information needed in order to understand the basic @@ -73,8 +73,8 @@ will never be performed. 2. The three Sentinel instances should be placed into computers or virtual machines that are believed to fail in an independent way. So for example different physical servers or Virtual Machines executed on different availability zones. 3. Sentinel + Valkey distributed system does not guarantee that acknowledged writes are retained during failures, since Valkey uses asynchronous replication. However there are ways to deploy Sentinel that make the window to lose writes limited to certain moments, while there are other less secure ways to deploy it. 4. You need Sentinel support in your clients. Popular client libraries have Sentinel support, but not all. -5. There is no HA setup which is safe if you don't test from time to time in development environments, or even better if you can, in production environments, if they work. You may have a misconfiguration that will become apparent only when it's too late (at 3am when your master stops working). -6. **Sentinel, Docker, or other forms of Network Address Translation or Port Mapping should be mixed with care**: Docker performs port remapping, breaking Sentinel auto discovery of other Sentinel processes and the list of replicas for a master. Check the [section about _Sentinel and Docker_](#sentinel-docker-nat-and-possible-issues) later in this document for more information. +5. There is no HA setup which is safe if you don't test from time to time in development environments, or even better, if you can in production environments, if they work. You may have a misconfiguration that will become apparent only when it's too late (at 3am when your primary stops working). +6. **Sentinel, Docker, or other forms of Network Address Translation or Port Mapping should be mixed with care**: Docker performs port remapping, breaking Sentinel auto discovery of other Sentinel processes and the list of replicas for a primary. Check the [section about _Sentinel and Docker_](#sentinel-docker-nat-and-possible-issues) later in this document for more information. ### Configuring Sentinel @@ -93,36 +93,36 @@ following: sentinel failover-timeout resque 180000 sentinel parallel-syncs resque 5 -You only need to specify the masters to monitor, giving to each separated -master (that may have any number of replicas) a different name. There is no +You only need to specify the primaries to monitor, giving to each separated +primary (that may have any number of replicas) a different name. There is no need to specify replicas, which are auto-discovered. Sentinel will update the configuration automatically with additional information about replicas (in order to retain the information in case of restart). The configuration is -also rewritten every time a replica is promoted to master during a failover +also rewritten every time a replica is promoted to primary during a failover and every time a new Sentinel is discovered. The example configuration above basically monitors two sets of Valkey -instances, each composed of a master and an undefined number of replicas. +instances, each composed of a primary and an undefined number of replicas. One set of instances is called `mymaster`, and the other `resque`. The meaning of the arguments of `sentinel monitor` statements is the following: - sentinel monitor + sentinel monitor For the sake of clarity, let's check line by line what the configuration options mean: -The first line is used to tell Valkey to monitor a master called *mymaster*, +The first line is used to tell Valkey to monitor a primary called *mymaster*, that is at address 127.0.0.1 and port 6379, with a quorum of 2. Everything is pretty obvious but the **quorum** argument: -* The **quorum** is the number of Sentinels that need to agree about the fact the master is not reachable, in order to really mark the master as failing, and eventually start a failover procedure if possible. +* The **quorum** is the number of Sentinels that need to agree about the fact the primary is not reachable, in order to really mark the primary as failing, and eventually start a failover procedure if possible. * However **the quorum is only used to detect the failure**. In order to actually perform a failover, one of the Sentinels need to be elected leader for the failover and be authorized to proceed. This only happens with the vote of the **majority of the Sentinel processes**. So for example if you have 5 Sentinel processes, and the quorum for a given -master set to the value of 2, this is what happens: +primary is set to the value of 2, this is what happens: -* If two Sentinels agree at the same time about the master being unreachable, one of the two will try to start a failover. +* If two Sentinels agree at the same time about the primary being unreachable, one of the two will try to start a failover. * If there are at least a total of three Sentinels reachable, the failover will be authorized and will actually start. In practical terms this means during failures **Sentinel never starts a failover if the majority of Sentinel processes are unable to talk** (aka no failover in the minority partition). @@ -131,7 +131,7 @@ In practical terms this means during failures **Sentinel never starts a failover The other options are almost always in the form: - sentinel + sentinel And are used for the following purposes: @@ -139,12 +139,12 @@ And are used for the following purposes: be reachable (either does not reply to our PINGs or it is replying with an error) for a Sentinel starting to think it is down. * `parallel-syncs` sets the number of replicas that can be reconfigured to use -the new master after a failover at the same time. The lower the number, the +the new primary after a failover at the same time. The lower the number, the more time it will take for the failover process to complete, however if the replicas are configured to serve old data, you may not want all the replicas to -re-synchronize with the master at the same time. While the replication +re-synchronize with the primary at the same time. While the replication process is mostly non blocking for a replica, there is a moment when it stops to -load the bulk data from the master. You may want to make sure only one replica +load the bulk data from the primary. You may want to make sure only one replica at a time is not reachable by setting this option to the value of 1. Additional options are described in the rest of this document and @@ -177,7 +177,7 @@ format, this is what the different symbols means: We write inside the boxes what they are running: +--------------------+ - | Valkey master M1 | + | Valkey primary M1 | | Valkey Sentinel S1 | +--------------------+ @@ -195,11 +195,11 @@ Network partitions are shown as interrupted lines using slashes: Also note that: -* Masters are called M1, M2, M3, ..., Mn. +* Primaries are called M1, M2, M3, ..., Mn. * Replicas are called R1, R2, R3, ..., Rn (R stands for *replica*). * Sentinels are called S1, S2, S3, ..., Sn. * Clients are called C1, C2, C3, ..., Cn. -* When an instance changes role because of Sentinel actions, we put it inside square brackets, so [M1] means an instance that is now a master because of Sentinel intervention. +* When an instance changes role because of Sentinel actions, we put it inside square brackets, so [M1] means an instance that is now a primary because of Sentinel intervention. Note that we will never show **setups where just two Sentinels are used**, since Sentinels always need **to talk with the majority** in order to start a @@ -214,7 +214,7 @@ failover. Configuration: quorum = 1 -* In this setup, if the master M1 fails, R1 will be promoted since the two Sentinels can reach agreement about the failure (obviously with quorum set to 1) and can also authorize a failover because the majority is two. So apparently it could superficially work, however check the next points to see why this setup is broken. +* In this setup, if the primary M1 fails, R1 will be promoted since the two Sentinels can reach agreement about the failure (obviously with quorum set to 1) and can also authorize a failover because the majority is two. So apparently it could superficially work, however check the next points to see why this setup is broken. * If the box where M1 is running stops working, also S1 stops working. The Sentinel running in the other box S2 will not be able to authorize a failover, so the system will become not available. Note that a majority is needed in order to order different failovers, and later propagate the latest configuration to all the Sentinels. Also note that the ability to failover in a single side of the above setup, without any agreement, would be very dangerous: @@ -224,7 +224,7 @@ Note that a majority is needed in order to order different failovers, and later | S1 | | S2 | +----+ +------+ -In the above configuration we created two masters (assuming S2 could failover +In the above configuration we created two primaries (assuming S2 could failover without authorization) in a perfectly symmetrical way. Clients may write indefinitely to both sides, and there is no way to understand when the partition heals what configuration is the right one, in order to prevent @@ -251,14 +251,14 @@ a Valkey process and a Sentinel process. Configuration: quorum = 2 -If the master M1 fails, S2 and S3 will agree about the failure and will +If the primary M1 fails, S2 and S3 will agree about the failure and will be able to authorize a failover, making clients able to continue. In every Sentinel setup, as Valkey uses asynchronous replication, there is always the risk of losing some writes because a given acknowledged write -may not be able to reach the replica which is promoted to master. However in +may not be able to reach the replica which is promoted to primary. However in the above setup there is a higher risk due to clients being partitioned away -with an old master, like in the following picture: +with an old primary, like in the following picture: +----+ | M1 | @@ -272,30 +272,30 @@ with an old master, like in the following picture: | S2 | | S3 | +------+ +----+ -In this case a network partition isolated the old master M1, so the -replica R2 is promoted to master. However clients, like C1, that are -in the same partition as the old master, may continue to write data -to the old master. This data will be lost forever since when the partition -will heal, the master will be reconfigured as a replica of the new master, +In this case a network partition isolated the old primary M1, so the +replica R2 is promoted to primary. However clients, like C1, that are +in the same partition as the old primary, may continue to write data +to the old primary. This data will be lost forever since when the partition +will heal, the primary will be reconfigured as a replica of the new primary, discarding its data set. This problem can be mitigated using the following Valkey replication -feature, that allows to stop accepting writes if a master detects that +feature, that allows to stop accepting writes if a primary detects that it is no longer able to transfer its writes to the specified number of replicas. min-replicas-to-write 1 min-replicas-max-lag 10 -With the above configuration (please see the self-commented `valkey.conf` example in the Valkey distribution for more information) a Valkey instance, when acting as a master, will stop accepting writes if it can't write to at least 1 replica. Since replication is asynchronous *not being able to write* actually means that the replica is either disconnected, or is not sending us asynchronous acknowledges for more than the specified `max-lag` number of seconds. +With the above configuration (please see the self-commented `valkey.conf` example in the Valkey distribution for more information) a Valkey instance, when acting as a primary, will stop accepting writes if it can't write to at least 1 replica. Since replication is asynchronous *not being able to write* actually means that the replica is either disconnected, or is not sending us asynchronous acknowledges for more than the specified `max-lag` number of seconds. -Using this configuration, the old Valkey master M1 in the above example, will become unavailable after 10 seconds. When the partition heals, the Sentinel configuration will converge to the new one, the client C1 will be able to fetch a valid configuration and will continue with the new master. +Using this configuration, the old Valkey primary M1 in the above example, will become unavailable after 10 seconds. When the partition heals, the Sentinel configuration will converge to the new one, the client C1 will be able to fetch a valid configuration and will continue with the new primary. However there is no free lunch. With this refinement, if the two replicas are -down, the master will stop accepting writes. It's a trade off. +down, the primary will stop accepting writes. It's a trade off. #### Example 3: Sentinel in the client boxes -Sometimes we have only two Valkey boxes available, one for the master and +Sometimes we have only two Valkey boxes available, one for the primary and one for the replica. The configuration in the example 2 is not viable in that case, so we can resort to the following, where Sentinels are placed where clients are: @@ -316,7 +316,7 @@ where clients are: Configuration: quorum = 2 In this setup, the point of view Sentinels is the same as the clients: if -a master is reachable by the majority of the clients, it is fine. +a primary is reachable by the majority of the clients, it is fine. C1, C2, C3 here are generic clients, it does not mean that C1 identifies a single client connected to Valkey. It is more likely something like an application server, a Rails app, or something like that. @@ -325,20 +325,20 @@ If the box where M1 and S1 are running fails, the failover will happen without issues, however it is easy to see that different network partitions will result in different behaviors. For example Sentinel will not be able to setup if the network between the clients and the Valkey servers is -disconnected, since the Valkey master and replica will both be unavailable. +disconnected, since the Valkey primary and replica will both be unavailable. Note that if C3 gets partitioned with M1 (hardly possible with the network described above, but more likely possible with different layouts, or because of failures at the software layer), we have a similar issue as described in Example 2, with the difference that here we have -no way to break the symmetry, since there is just a replica and master, so -the master can't stop accepting queries when it is disconnected from its replica, -otherwise the master would never be available during replica failures. +no way to break the symmetry, since there is just a replica and a primary, so +the primary can't stop accepting queries when it is disconnected from its replica, +otherwise the primary would never be available during replica failures. So this is a valid setup but the setup in the Example 2 has advantages such as the HA system of Valkey running in the same boxes as Valkey itself which may be simpler to manage, and the ability to put a bound on the amount -of time a master in the minority partition can receive writes. +of time a primary in the minority partition can receive writes. #### Example 4: Sentinel client side with less than three clients @@ -362,7 +362,7 @@ case we need to resort to a mixed setup like the following: Configuration: quorum = 3 This is similar to the setup in Example 3, but here we run four Sentinels -in the four boxes we have available. If the master M1 becomes unavailable +in the four boxes we have available. If the primary M1 becomes unavailable the other three Sentinels will perform the failover. In theory this setup works removing the box where C2 and S4 are running, and @@ -383,13 +383,13 @@ not ports but also IP addresses. Remapping ports and addresses creates issues with Sentinel in two ways: 1. Sentinel auto-discovery of other Sentinels no longer works, since it is based on *hello* messages where each Sentinel announce at which port and IP address they are listening for connection. However Sentinels have no way to understand that an address or port is remapped, so it is announcing an information that is not correct for other Sentinels to connect. -2. Replicas are listed in the `INFO` output of a Valkey master in a similar way: the address is detected by the master checking the remote peer of the TCP connection, while the port is advertised by the replica itself during the handshake, however the port may be wrong for the same reason as exposed in point 1. +2. Replicas are listed in the `INFO` output of a Valkey primary in a similar way: the address is detected by the primary checking the remote peer of the TCP connection, while the port is advertised by the replica itself during the handshake, however the port may be wrong for the same reason as exposed in point 1. -Since Sentinels auto detect replicas using masters `INFO` output information, +Since Sentinels auto detect replicas using primaries `INFO` output information, the detected replicas will not be reachable, and Sentinel will never be able to -failover the master, since there are no good replicas from the point of view of +failover the primary, since there are no good replicas from the point of view of the system, so there is currently no way to monitor with Sentinel a set of -master and replica instances deployed with Docker, **unless you instruct Docker +primary and replica instances deployed with Docker, **unless you instruct Docker to map the port 1:1**. For the first problem, in case you want to run a set of Sentinel @@ -433,7 +433,7 @@ that want to play with the system ASAP, this section is a tutorial that shows how to configure and interact with 3 Sentinel instances. Here we assume that the instances are executed at port 5000, 5001, 5002. -We also assume that you have a running Valkey master at port 6379 with a +We also assume that you have a running Valkey primary at port 6379 with a replica running at port 6380. We will use the IPv4 loopback address 127.0.0.1 everywhere during the tutorial, assuming you are running the simulation on your personal computer. @@ -451,9 +451,9 @@ as port numbers. A few things to note about the above configuration: -* The master set is called `mymaster`. It identifies the master and its replicas. Since each *master set* has a different name, Sentinel can monitor different sets of masters and replicas at the same time. +* The primary set is called `mymaster`. It identifies the primary and its replicas. Since each *primary set* has a different name, Sentinel can monitor different sets of primaries and replicas at the same time. * The quorum was set to the value of 2 (last argument of `sentinel monitor` configuration directive). -* The `down-after-milliseconds` value is 5000 milliseconds, that is 5 seconds, so masters will be detected as failing as soon as we don't receive any reply from our pings within this amount of time. +* The `down-after-milliseconds` value is 5000 milliseconds, that is 5 seconds, so primaries will be detected as failing as soon as we don't receive any reply from our pings within this amount of time. Once you start the three Sentinels, you'll see a few messages they log, like: @@ -465,11 +465,11 @@ if you `SUBSCRIBE` to the event name as specified later in [_Pubsub Messages_ se Sentinel generates and logs different events during failure detection and failover. -Asking Sentinel about the state of a master +Asking Sentinel about the state of a primary --- The most obvious thing to do with Sentinel to get started, is check if the -master it is monitoring is doing well: +primary it is monitoring is doing well: $ valkey-cli -p 5000 127.0.0.1:5000> sentinel master mymaster @@ -514,12 +514,12 @@ master it is monitoring is doing well: 39) "parallel-syncs" 40) "1" -As you can see, it prints a number of information about the master. There are +As you can see, it prints a number of information about the primary. There are a few that are of particular interest for us: -1. `num-other-sentinels` is 2, so we know the Sentinel already detected two more Sentinels for this master. If you check the logs you'll see the `+sentinel` events generated. -2. `flags` is just `master`. If the master was down we could expect to see `s_down` or `o_down` flag as well here. -3. `num-slaves` is correctly set to 1, so Sentinel also detected that there is an attached replica to our master. +1. `num-other-sentinels` is 2, so we know the Sentinel already detected two more Sentinels for this primary. If you check the logs you'll see the `+sentinel` events generated. +2. `flags` is just `master`. If the primary was down we could expect to see `s_down` or `o_down` flag as well here. +3. `num-slaves` is correctly set to 1, so Sentinel also detected that there is an attached replica to our primary. In order to explore more about this instance, you may want to try the following two commands: @@ -528,15 +528,15 @@ two commands: SENTINEL sentinels mymaster The first will provide similar information about the replicas connected to the -master, and the second about the other Sentinels. +primary, and the second about the other Sentinels. -Obtaining the address of the current master +Obtaining the address of the current primary --- As we already specified, Sentinel also acts as a configuration provider for -clients that want to connect to a set of master and replicas. Because of +clients that want to connect to a set of primary and replicas. Because of possible failovers or reconfigurations, clients have no idea about who is -the currently active master for a given set of instances, so Sentinel exports +the currently active primary for a given set of instances, so Sentinel exports an API to ask this question: 127.0.0.1:5000> SENTINEL get-master-addr-by-name mymaster @@ -546,22 +546,22 @@ an API to ask this question: ### Testing the failover At this point our toy Sentinel deployment is ready to be tested. We can -just kill our master and check if the configuration changes. To do so +just kill our primary and check if the configuration changes. To do so we can just do: valkey-cli -p 6379 DEBUG sleep 30 -This command will make our master no longer reachable, sleeping for 30 seconds. -It basically simulates a master hanging for some reason. +This command will make our primary no longer reachable, sleeping for 30 seconds. +It basically simulates a primary hanging for some reason. If you check the Sentinel logs, you should be able to see a lot of action: -1. Each Sentinel detects the master is down with an `+sdown` event. -2. This event is later escalated to `+odown`, which means that multiple Sentinels agree about the fact the master is not reachable. +1. Each Sentinel detects the primary is down with an `+sdown` event. +2. This event is later escalated to `+odown`, which means that multiple Sentinels agree about the fact the primary is not reachable. 3. Sentinels vote a Sentinel that will start the first failover attempt. 4. The failover happens. -If you ask again what is the current master address for `mymaster`, eventually +If you ask again what is the current primary address for `mymaster`, eventually we should get a different reply this time: 127.0.0.1:5000> SENTINEL get-master-addr-by-name mymaster @@ -574,7 +574,7 @@ or can read more to understand all the Sentinel commands and internals. ## Sentinel API Sentinel provides an API in order to inspect its state, check the health -of monitored masters and replicas, subscribe in order to receive specific +of monitored primaries and replicas, subscribe in order to receive specific notifications, and change the Sentinel configuration at run time. By default Sentinel runs using TCP port 26379 (note that 6379 is the normal @@ -595,23 +595,23 @@ The `SENTINEL` command is the main API for Sentinel. The following is the list o * **SENTINEL CONFIG GET ``** (`>= 6.2`) Get the current value of a global Sentinel configuration parameter. The specified name may be a wildcard, similar to the Valkey `CONFIG GET` command. * **SENTINEL CONFIG SET `` ``** (`>= 6.2`) Set the value of a global Sentinel configuration parameter. -* **SENTINEL CKQUORUM ``** Check if the current Sentinel configuration is able to reach the quorum needed to failover a master, and the majority needed to authorize the failover. This command should be used in monitoring systems to check if a Sentinel deployment is ok. +* **SENTINEL CKQUORUM ``** Check if the current Sentinel configuration is able to reach the quorum needed to failover a primary, and the majority needed to authorize the failover. This command should be used in monitoring systems to check if a Sentinel deployment is ok. * **SENTINEL FLUSHCONFIG** Force Sentinel to rewrite its configuration on disk, including the current Sentinel state. Normally Sentinel rewrites the configuration every time something changes in its state (in the context of the subset of the state which is persisted on disk across restart). However sometimes it is possible that the configuration file is lost because of operation errors, disk failures, package upgrade scripts or configuration managers. In those cases a way to force Sentinel to rewrite the configuration file is handy. This command works even if the previous configuration file is completely missing. -* **SENTINEL FAILOVER ``** Force a failover as if the master was not reachable, and without asking for agreement to other Sentinels (however a new version of the configuration will be published so that the other Sentinels will update their configurations). -* **SENTINEL GET-MASTER-ADDR-BY-NAME ``** Return the ip and port number of the master with that name. If a failover is in progress or terminated successfully for this master it returns the address and port of the promoted replica. -* **SENTINEL INFO-CACHE** Return cached `INFO` output from masters and replicas. -* **SENTINEL IS-MASTER-DOWN-BY-ADDR ** Check if the master specified by ip:port is down from current Sentinel's point of view. This command is mostly for internal use. -* **SENTINEL MASTER ``** Show the state and info of the specified master. -* **SENTINEL MASTERS** Show a list of monitored masters and their state. +* **SENTINEL FAILOVER ``** Force a failover as if the primary was not reachable, and without asking for agreement to other Sentinels (however a new version of the configuration will be published so that the other Sentinels will update their configurations). +* **SENTINEL GET-MASTER-ADDR-BY-NAME ``** Return the ip and port number of the primary with that name. If a failover is in progress or terminated successfully for this primary it returns the address and port of the promoted replica. +* **SENTINEL INFO-CACHE** Return cached `INFO` output from primaries and replicas. +* **SENTINEL IS-MASTER-DOWN-BY-ADDR ** Check if the primary specified by ip:port is down from current Sentinel's point of view. This command is mostly for internal use. +* **SENTINEL MASTER ``** Show the state and info of the specified primary. +* **SENTINEL MASTERS** Show a list of monitored primaries and their state. * **SENTINEL MONITOR** Start Sentinel's monitoring. Refer to the [_Reconfiguring Sentinel at Runtime_ section](#reconfiguring-sentinel-at-runtime) for more information. * **SENTINEL MYID** (`>= 6.2`) Return the ID of the Sentinel instance. * **SENTINEL PENDING-SCRIPTS** This command returns information about pending scripts. * **SENTINEL REMOVE** Stop Sentinel's monitoring. Refer to the [_Reconfiguring Sentinel at Runtime_ section](#reconfiguring-sentinel-at-runtime) for more information. -* **SENTINEL REPLICAS ``** Show a list of replicas for this master, and their state. -* **SENTINEL SENTINELS ``** Show a list of sentinel instances for this master, and their state. +* **SENTINEL REPLICAS ``** Show a list of replicas for this primary, and their state. +* **SENTINEL SENTINELS ``** Show a list of sentinel instances for this primary, and their state. * **SENTINEL SET** Set Sentinel's monitoring configuration. Refer to the [_Reconfiguring Sentinel at Runtime_ section](#reconfiguring-sentinel-at-runtime) for more information. * **SENTINEL SIMULATE-FAILURE (crash-after-election|crash-after-promotion|help)** This command simulates different Sentinel crash scenarios. -* **SENTINEL RESET ``** This command will reset all the masters with matching name. The pattern argument is a glob-style pattern. The reset process clears any previous state in a master (including a failover in progress), and removes every replica and sentinel already discovered and associated with the master. +* **SENTINEL RESET ``** This command will reset all the primaries with matching name. The pattern argument is a glob-style pattern. The reset process clears any previous state in a primary (including a failover in progress), and removes every replica and sentinel already discovered and associated with the primary. For connection management and administration purposes, Sentinel supports the following subset of Valkey's commands: @@ -622,26 +622,26 @@ For connection management and administration purposes, Sentinel supports the fol * **HELLO** (`>= 6.0`) Switch the connection's protocol. For more information refer to the `HELLO` command. * **INFO** Return information and statistics about the Sentinel server. For more information see the `INFO` command. * **PING** This command simply returns PONG. -* **ROLE** This command returns the string "sentinel" and a list of monitored masters. For more information refer to the `ROLE` command. +* **ROLE** This command returns the string "sentinel" and a list of monitored primaries. For more information refer to the `ROLE` command. * **SHUTDOWN** Shut down the Sentinel instance. Lastly, Sentinel also supports the `SUBSCRIBE`, `UNSUBSCRIBE`, `PSUBSCRIBE` and `PUNSUBSCRIBE` commands. Refer to the [_Pub/Sub Messages_ section](#pubsub-messages) for more details. ### Reconfiguring Sentinel at Runtime -Sentinel provides an API in order to add, remove, or change the configuration of a given master. Note that if you have multiple sentinels you should apply the changes to all to your instances for Valkey Sentinel to work properly. This means that changing the configuration of a single Sentinel does not automatically propagate the changes to the other Sentinels in the network. +Sentinel provides an API in order to add, remove, or change the configuration of a given primary. Note that if you have multiple sentinels you should apply the changes to all to your instances for Valkey Sentinel to work properly. This means that changing the configuration of a single Sentinel does not automatically propagate the changes to the other Sentinels in the network. The following is a list of `SENTINEL` subcommands used in order to update the configuration of a Sentinel instance. -* **SENTINEL MONITOR `` `` `` ``** This command tells the Sentinel to start monitoring a new master with the specified name, ip, port, and quorum. It is identical to the `sentinel monitor` configuration directive in `sentinel.conf` configuration file, with the difference that you can't use a hostname in as `ip`, but you need to provide an IPv4 or IPv6 address. -* **SENTINEL REMOVE ``** is used in order to remove the specified master: the master will no longer be monitored, and will totally be removed from the internal state of the Sentinel, so it will no longer listed by `SENTINEL masters` and so forth. -* **SENTINEL SET `` [`