Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added logic to update the slot map based on MOVED errors #186

Merged

Conversation

barshaul
Copy link

@barshaul barshaul commented Aug 28, 2024

Added logic to update the slot map based on MOVED errors. Now, when a MOVED error is received, the client will push a new future before retrying the request to attempt to update the slot map based on the new primary node, while still spawning a background task to refresh slots. This provides an optimization to ensure quicker rerouting of requests without waiting for the background slot refresh which can be skipped due to the rate limiter.

The updated logic handles several scenarios:

  1. No Change: If the new primary is already the current slot owner, no changes are required.
  2. Failover: If the new primary is a replica in the same shard, the client promotes the replica to primary by updating the shard's addresses.
  3. Slot Migration: If the new primary is the existing primary of a different shard, the client updates the slot map to point to the new shard's addresses.
  4. Replica Moved to a Different Shard: If the new primary is a replica in a different shard, it is either promoted to primary of its shard or moved to a new shard. The replica is removed from its original shard, and the slot map is updated accordingly.
  5. New Node: If the new primary is an unknown node, the client adds it as a new primary node in a new shard, possibly indicating a scale-out.

If the slot map update fails, the request is retried with the moved redirect node, and the background slot refresh task will correct the map asynchronously.

…hard between shard nodes and slot map values
@barshaul barshaul force-pushed the update_on_moves branch 4 times, most recently from f41533b to b0a3e5c Compare September 8, 2024 16:20
@barshaul barshaul force-pushed the update_on_moves branch 2 times, most recently from a374c59 to e17a802 Compare September 11, 2024 12:54
@barshaul barshaul marked this pull request as ready for review September 11, 2024 13:25
@barshaul barshaul changed the title WORK IN PROGRESS - Update on moves Added logic to update the slot map based on MOVED errors Sep 11, 2024
Copy link

@eifrah-aws eifrah-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider changing Arc<RwLock<ShardAddr>> into a clean API so ShardAddr will manage the locks internally and we can pass Arc<ShardAddr>

redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
// If it fails, proceed by retrying the request with the redirected node,
// and allow the slot refresh task to correct the slot map.
warn!(
"Failed to update the slot map based on the received MOVED error.\n

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Avoid using \n in the log as it create line without its prefix. Please use 2 x warn! calls
  2. If it is OK to fail, why use warn!? (this is a good place to use telemetry call - when its available)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, changed to info!

redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved

if curr_shard_addrs_read.replicas().contains(&new_primary) {
// Scenario 2: Failover - The new primary is a replica within the same shard
drop(curr_shard_addrs_read);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quesiton: we are dropping and then re-acquiring it, meaning: we are losing atomicity - does this may yield a problem for us?.

If it does, it better to obtain write lock to begin with

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right I found some potential issues. I’ve updated it to acquire the write lock at the start

redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
redis/src/cluster_async/mod.rs Outdated Show resolved Hide resolved
}

// Scenario 5: New Node - The new primary is not present in the current slots map, add it as a primary of a new shard.
drop(nodes_iter);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the explicit drop here? won't it be dropped when leaving the function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to drop it bc it borrows connection container as immutable and we need to do some mutable borrowing next line

redis/src/cluster_routing.rs Outdated Show resolved Hide resolved
redis/src/cluster_routing.rs Outdated Show resolved Hide resolved
redis/src/cluster_routing.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants