Add support for raft repl dev replace member. #546

sanebay · 2024-09-11T22:04:55Z

When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.

Testing

Bring replica-1, replica-2, replica-3. Replica-1 is the leader. Replace the replica-3 with a new spare replica-4.

sanebay · 2024-09-11T22:05:46Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+    }
+
+    // Step 2. Add the new member.
+    return m_msg_mgr.add_member(m_group_id, member_in_uuid)


Add member, append long entry and remove the old member

sanebay · 2024-09-11T22:06:23Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+    RD_LOGI("Raft repl replace_member member_out={} member_in={}", boost::uuids::to_string(member_out),
+            boost::uuids::to_string(member_in));
+
+    m_listener->replace_member(member_out, member_in);


homeobject or listener can use this to update their pg metablks to remove old and add new member

src/lib/replication/service/raft_repl_service.cpp

sanebay · 2024-09-11T22:07:41Z

src/tests/test_common/hs_repl_test_common.hpp

@@ -134,11 +136,12 @@ class HSReplTestHelper : public HSTestHelper {
    HSReplTestHelper(std::string const& name, std::vector< std::string > const& args, char** argv) :
            name_{name}, args_{args}, argv_{argv} {}

-    void setup() {
+    void setup(uint32_t num_replicas) {


setup takes how many replica's to start.

sanebay · 2024-09-11T22:09:48Z

src/tests/test_common/raft_repl_test_base.hpp

@@ -0,0 +1,629 @@
+/*********************************************************************************


Moved common code from test_raft_repl_dev.cpp to raft_test_base.hpp. No change in code. Just moved the classes.

sanebay · 2024-09-11T22:10:24Z

src/tests/test_raft_repl_dev.cpp

-        LOGINFOMOD(replication, "[Replica={}] Read logical snapshot callback obj_id={} term={} idx={} num_items={}",
-                   g_helper->replica_num(), snp_data->offset, s->get_last_log_term(), s->get_last_log_idx(),
-                   kv_snapshot_data.size());
+#include "test_common/raft_repl_test_base.hpp"


Moved these code to raft_repl_test_base.hpp header file.

sanebay · 2024-09-11T22:11:51Z

src/tests/test_raft_repl_dev_dynamic.cpp

@@ -0,0 +1,132 @@
+/*********************************************************************************


This test creates repl dev group with 3 replica's by default and then add spare replica's. We couldnt make common code becasue test_raft_repl_dev test cases assume all replica's to be part of the group from the beginning itself.

JacksonYao287

@sanebay thanks for the patch!

we need take leader change into account, pls see my comments and correct me if I am wrong

src/lib/replication/repl_dev/raft_repl_dev.cpp

JacksonYao287 · 2024-09-18T03:14:18Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+
+            LOGINFO("Replace member added member={} to group_id={}", member_in, group_id_str());
+
+            // Step 3. Append log entry to mark the old member is out and new member is added.


if leader changes after line140 and before line 147, then who can send a log entry to mark the old member is out and remove member_out of out this group?

should we separate replace_member to two calls: add_member and remove_member?

Yeah valid scenario. In that case log entry append will fail at line 157, CM will retry again with the latest leader. Thats the flow we have decided with CM. It should be safe to send the same message to a different leader.

It should be safe to send the same message to a different leader.

if leader changes after propose_to_raft but before m_msg_mgr.rem_member, then it will return ReplServiceError::RETRY_REQUEST .

now , if cm resends the request to the new leader, I think add_member will succeed in new leader since the member_in is already added in the group by the previous leader. now, next step is propose_to_raft, which will create another HS_CTRL_REPLACE log and commit it.

so HS_CTRL_REPLACE will be committed twice at the other follower, one is from previous leader , one is from current leader.

we should make it safe to commit HS_CTRL_REPLACE twice（or make committing HS_CTRL_REPLACE idempotent） for the same member_out and member_in. so we need to add some checks in m_listener#replace_member(member_out, member_in);

so, I think basically it will be safe, but we should add some code to make committing HS_CTRL_REPLACE idempotent

more for cm:
if the new leader is the member_in, we havent respond to the call yet. We get the leader (in nuraft_mesg) from raft which is always up_to_date, and respond to CM.

Wondering if CM expect HB from a to-be-added SM , or expect retry aganist a to-be-added SM.

idempotency of HS_CTRL_REPLACE is same as create pg or shard. Gateway can send request leader, leader does raft and before returning success to gateway crashed and restarted, gateway will send to same leader and we might create pg. Its a NOP in ho. In HO in add_replace function , in and out members are already applied, then its a nop.

sanebay · 2024-09-18T04:35:38Z

we may need add_member and remove_member, if we do expansion/shrink. Our use case is only replace and we want to update in the homeobject PG metablk in single write atomically. The member out will have its PG metablk with destroyed flag enabled. GC will use this delete all blobs for that PG.

JacksonYao287 · 2024-09-18T06:50:11Z

The member out will have its PG metablk with destroyed flag enabled. GC will use this delete all blobs for that PG.

do we need gc for the member out? I think if we remove it from the group , we can just re-fromat it with homestore, then it will be clean , no?

src/lib/replication/service/raft_repl_service.cpp

src/tests/test_raft_repl_dev_dynamic.cpp

xiaoxichen · 2024-09-18T15:53:57Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+        .via(&folly::InlineExecutor::instance())
+        .thenValue([this, member_in_uuid, member_out_uuid](auto&& e) -> AsyncReplResult<> {
+            // TODO Currently we ignore the cancelled, fix nuraft_mesg to not timeout
+            // when adding member. Member is added to cluster config until member syncs fully


can we change the behavior? This is very odd by mixing member change vs data syncing.

This is a nuraft behaviour. I had same doubt. This what Jungsang shared.
"""

there are three servers: s1, s2, s3. So quorum size is 2.

let's say s3 is lagging behind, but still s1 can make a commit, by forming quorum {s1, s2}.

now new member s4 joins the cluster.

if s4 joins the cluster immediately without catch-up, it is also lagging behind.

since it joins the cluster immediately, now quorum size becomes 3.

now s1 cannot make a commit, no matter which member is in the quorum: {s1, s2, s3}, {s1, s2, s4}.
"""

this scenario makes sense.
another question , do we have a plan to have a separate add_member and rem_member?

for example, if we have a very heavy read workload for one pg, we want to add some new members to take over some read workload. so we only need add_memeber.

@JacksonYao287 we dont want that as of now. We dont have the flexibility to control replication size of each PG in control plane. Lets revisit if we have solid use case.

@sanebay that makes sense from the raft point of view, but make it very tricky for cm->sm protocal. Once leader SM reboot, what would be the state of the in-flight add_srv? will it continue till success or being discard? who should take the responsibility to respond to CM?

Is there a possibility that we can add new_member as learner and promoted to participant once it catches up?

also for a failed/forgot add_srv, who should take the responsibility to reclaim space? If CM retries with same PG to same SM I believe baseline resync will erase and re-do. If CM attempt with other PG to this SM I am not sure how we handle this --- linking to the fixed pg size design @JacksonYao287

@sanebay that makes sense from the raft point of view, but make it very tricky for cm->sm protocal. Once leader SM reboot, what would be the state of the in-flight add_srv? will it continue till success or being discard? who should take the responsibility to respond to CM?

Is there a possibility that we can add new_member as learner and promoted to participant once it catches up?

add_srv internally does cluster config change with log entry. So it needs commit . Its treated as same append entry log. If committed and crashed, CM can send again and we return error that server already exists. If it didnt commit, it will rollback once another leader selects. I havent explored the learner option.

also for a failed/forgot add_srv, who should take the responsibility to reclaim space?
No as mentioned, add_srv should fail with error SERVER_ALREADY_EXISTS . As per design, if new member failed after sometime, manual intervnetion is needed and a new PG move is needed(https://github.com/eBay/NuRaft/blob/master/src/handle_join_leave.cxx#L68)

also for a failed/forgot add_srv, who should take the responsibility to reclaim space? If CM retries with same PG to same SM I believe baseline resync will erase and re-do. If CM attempt with other PG to this SM I am not sure how we handle this

yes, a failed add_srv will leave some stale data in the disk of member_in , it is caused by the snapshot before the config change is committed.

I have gone through the raft code. I will add the solution in the doc

xiaoxichen · 2024-09-18T16:04:01Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+
+            LOGINFO("Replace member added member={} to group_id={}", member_in, group_id_str());
+
+            // Step 3. Append log entry to mark the old member is out and new member is added.


more for cm:
if the new leader is the member_in, we havent respond to the call yet. We get the leader (in nuraft_mesg) from raft which is always up_to_date, and respond to CM.

Wondering if CM expect HB from a to-be-added SM , or expect retry aganist a to-be-added SM.

sanebay · 2024-09-18T20:54:49Z

The member out will have its PG metablk with destroyed flag enabled. GC will use this delete all blobs for that PG.

do we need gc for the member out? I think if we remove it from the group , we can just re-fromat it with homestore, then it will be clean , no?

One SM could have multiple PG's on same disk. We could move out some PG for space issues or load balance.

JacksonYao287 · 2024-09-19T06:53:49Z

the code generally looks good to me. pls resolve the conflict

codecov-commenter · 2024-09-19T22:24:43Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 60.78431% with 20 lines in your changes missing coverage. Please review.

Project coverage is 67.27%. Comparing base (1a0cef8) to head (afb64fc).
Report is 64 commits behind head on master.

Files with missing lines	Patch %	Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp	56.81%	18 Missing and 1 partial ⚠️
src/lib/replication/repl_dev/common.cpp	0.00%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #546       +/-   ##
===========================================
+ Coverage   56.51%   67.27%   +10.76%     
===========================================
  Files         108      109        +1     
  Lines       10300    10448      +148     
  Branches     1402     1408        +6     
===========================================
+ Hits         5821     7029     +1208     
+ Misses       3894     2743     -1151     
- Partials      585      676       +91

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sanebay · 2024-09-24T19:02:32Z

src/lib/replication/service/raft_repl_service.cpp

@@ -93,7 +93,10 @@ void RaftReplService::start() {
                        .with_hb_interval(HS_DYNAMIC_CONFIG(consensus.heartbeat_period_ms))
                        .with_max_append_size(HS_DYNAMIC_CONFIG(consensus.max_append_batch_size))
                        .with_log_sync_batch_size(HS_DYNAMIC_CONFIG(consensus.log_sync_batch_size))
+    // TODO fix the log_gap thresholds when adding new member.


Shared logs with nuraft team, but we couldnt find the root cause immediately. Will have to debug with nuraft team when they are available. Will create a ticket to keep track of this. #561

src/lib/replication/repl_dev/raft_repl_dev.cpp

src/lib/replication/repl_dev/raft_repl_dev.h

yamingk · 2024-09-24T22:35:13Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+            // Step 3. Append log entry to mark the old member is out and new member is added.
+            auto rreq = repl_req_ptr_t(new repl_req_ctx{});
+            replace_members_ctx members;
+            std::copy(member_in_uuid.begin(), member_in_uuid.end(), members.in.begin());


nit: we can't do member.in = std::move(members_in_uuid), can we?

We need a char array as we are serializing two members. members_in_uuid is a string, so we couldnt use it.

yamingk · 2024-09-24T22:53:02Z

src/lib/replication/repl_dev/raft_repl_dev.cpp

+
+    // Check if member_out exists in the group.
+    std::vector< nuraft::ptr< nuraft::srv_config > > configs_out;
+    m_msg_mgr.get_srv_config_all(m_group_id, configs_out);


if the out is not found, shouldn't we just move ahead to add the in_member. The reason is we have retry logic, and we don't need to worry about which operation actually did moved out the member with mix of crash in between and and if one operation did manage to move the old member, shouldn't we just go ahead ignore the out_member not seen in current group and add in_member?

Also we should check in_member is not already there in the raft group. Since we have the retry logic, and this retry needs to be idomponent.

Yeah we can do that. Instead of doing here, we can depend on raft server itself returns nuraft::cmd_result_code::SERVER_ALREADY_EXISTS for add and nuraft::cmd_result_code::SERVER_NOT_FOUND for remove and ignore those errors.

src/lib/replication/repl_dev/raft_repl_dev.cpp

When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.

sanebay commented Sep 11, 2024

View reviewed changes

sanebay requested review from yamingk, szmyd, xiaoxichen and JacksonYao287 September 11, 2024 22:12

JacksonYao287 reviewed Sep 18, 2024

View reviewed changes

xiaoxichen reviewed Sep 18, 2024

View reviewed changes

sanebay force-pushed the hs_replace_member branch from 0870f72 to e98bb96 Compare September 19, 2024 21:56

sanebay force-pushed the hs_replace_member branch from e98bb96 to 8186675 Compare September 19, 2024 22:30

sanebay commented Sep 24, 2024

View reviewed changes

yamingk reviewed Sep 24, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

yamingk reviewed Sep 24, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.h Outdated Show resolved Hide resolved

yamingk reviewed Sep 24, 2024

View reviewed changes

sanebay force-pushed the hs_replace_member branch from 8186675 to 8332bac Compare September 24, 2024 23:30

yamingk reviewed Sep 25, 2024

View reviewed changes

src/lib/replication/repl_dev/raft_repl_dev.cpp Show resolved Hide resolved

sanebay force-pushed the hs_replace_member branch from 8332bac to afb64fc Compare September 25, 2024 21:31

yamingk approved these changes Sep 25, 2024

View reviewed changes

sanebay merged commit e76afe5 into eBay:master Sep 25, 2024
22 checks passed

sanebay deleted the hs_replace_member branch September 25, 2024 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for raft repl dev replace member. #546

Add support for raft repl dev replace member. #546

sanebay commented Sep 11, 2024 •

edited

Loading

sanebay Sep 11, 2024

sanebay Sep 11, 2024

sanebay Sep 11, 2024

sanebay Sep 11, 2024

sanebay Sep 11, 2024

sanebay Sep 11, 2024 •

edited

Loading

JacksonYao287 left a comment

JacksonYao287 Sep 18, 2024

sanebay Sep 18, 2024

JacksonYao287 Sep 18, 2024 •

edited

Loading

xiaoxichen Sep 18, 2024

sanebay Sep 18, 2024 •

edited

Loading

sanebay commented Sep 18, 2024

JacksonYao287 commented Sep 18, 2024

xiaoxichen Sep 18, 2024

sanebay Sep 18, 2024 •

edited

Loading

JacksonYao287 Sep 19, 2024

xiaoxichen Sep 19, 2024

xiaoxichen Sep 19, 2024

xiaoxichen Sep 19, 2024

sanebay Sep 19, 2024

JacksonYao287 Sep 20, 2024

xiaoxichen Sep 18, 2024

sanebay commented Sep 18, 2024

JacksonYao287 commented Sep 19, 2024

codecov-commenter commented Sep 19, 2024 •

edited

Loading

sanebay Sep 24, 2024

yamingk Sep 24, 2024

sanebay Sep 24, 2024

yamingk Sep 24, 2024

sanebay Sep 24, 2024

		@@ -0,0 +1,629 @@
		/*********************************************************************************

		@@ -0,0 +1,132 @@
		/*********************************************************************************


		LOGINFO("Replace member added member={} to group_id={}", member_in, group_id_str());

		// Step 3. Append log entry to mark the old member is out and new member is added.

Add support for raft repl dev replace member. #546

Add support for raft repl dev replace member. #546

Conversation

sanebay commented Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanebay Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

JacksonYao287 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacksonYao287 Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanebay Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

sanebay commented Sep 18, 2024

JacksonYao287 commented Sep 18, 2024

Choose a reason for hiding this comment

sanebay Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanebay commented Sep 18, 2024

JacksonYao287 commented Sep 19, 2024

codecov-commenter commented Sep 19, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanebay commented Sep 11, 2024 •

edited

Loading

sanebay Sep 11, 2024 •

edited

Loading

JacksonYao287 Sep 18, 2024 •

edited

Loading

sanebay Sep 18, 2024 •

edited

Loading

sanebay Sep 18, 2024 •

edited

Loading

codecov-commenter commented Sep 19, 2024 •

edited

Loading