-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for raft repl dev replace member. #546
Conversation
} | ||
|
||
// Step 2. Add the new member. | ||
return m_msg_mgr.add_member(m_group_id, member_in_uuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add member, append long entry and remove the old member
RD_LOGI("Raft repl replace_member member_out={} member_in={}", boost::uuids::to_string(member_out), | ||
boost::uuids::to_string(member_in)); | ||
|
||
m_listener->replace_member(member_out, member_in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
homeobject or listener can use this to update their pg metablks to remove old and add new member
@@ -134,11 +136,12 @@ class HSReplTestHelper : public HSTestHelper { | |||
HSReplTestHelper(std::string const& name, std::vector< std::string > const& args, char** argv) : | |||
name_{name}, args_{args}, argv_{argv} {} | |||
|
|||
void setup() { | |||
void setup(uint32_t num_replicas) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup takes how many replica's to start.
@@ -0,0 +1,629 @@ | |||
/********************************************************************************* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved common code from test_raft_repl_dev.cpp to raft_test_base.hpp. No change in code. Just moved the classes.
LOGINFOMOD(replication, "[Replica={}] Read logical snapshot callback obj_id={} term={} idx={} num_items={}", | ||
g_helper->replica_num(), snp_data->offset, s->get_last_log_term(), s->get_last_log_idx(), | ||
kv_snapshot_data.size()); | ||
#include "test_common/raft_repl_test_base.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved these code to raft_repl_test_base.hpp header file.
@@ -0,0 +1,132 @@ | |||
/********************************************************************************* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test creates repl dev group with 3 replica's by default and then add spare replica's. We couldnt make common code becasue test_raft_repl_dev test cases assume all replica's to be part of the group from the beginning itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanebay thanks for the patch!
we need take leader change into account, pls see my comments and correct me if I am wrong
|
||
LOGINFO("Replace member added member={} to group_id={}", member_in, group_id_str()); | ||
|
||
// Step 3. Append log entry to mark the old member is out and new member is added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if leader changes after line140 and before line 147, then who can send a log entry to mark the old member is out and remove member_out of out this group?
should we separate replace_member to two calls: add_member and remove_member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah valid scenario. In that case log entry append will fail at line 157, CM will retry again with the latest leader. Thats the flow we have decided with CM. It should be safe to send the same message to a different leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be safe to send the same message to a different leader.
if leader changes after propose_to_raft
but before m_msg_mgr.rem_member
, then it will return ReplServiceError::RETRY_REQUEST
.
now , if cm resends the request to the new leader, I think add_member
will succeed in new leader since the member_in is already added in the group by the previous leader. now, next step is propose_to_raft
, which will create another HS_CTRL_REPLACE
log and commit it.
so HS_CTRL_REPLACE
will be committed twice at the other follower, one is from previous leader , one is from current leader.
we should make it safe to commit HS_CTRL_REPLACE
twice(or make committing HS_CTRL_REPLACE
idempotent) for the same member_out and member_in. so we need to add some checks in m_listener#replace_member(member_out, member_in);
so, I think basically it will be safe, but we should add some code to make committing HS_CTRL_REPLACE
idempotent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more for cm:
if the new leader is the member_in
, we havent respond to the call yet. We get the leader (in nuraft_mesg) from raft which is always up_to_date, and respond to CM.
Wondering if CM expect HB from a to-be-added SM , or expect retry aganist a to-be-added SM
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idempotency of HS_CTRL_REPLACE is same as create pg or shard. Gateway can send request leader, leader does raft and before returning success to gateway crashed and restarted, gateway will send to same leader and we might create pg. Its a NOP in ho. In HO in add_replace function , in and out members are already applied, then its a nop.
we may need add_member and remove_member, if we do expansion/shrink. Our use case is only replace and we want to update in the homeobject PG metablk in single write atomically. The member out will have its PG metablk with destroyed flag enabled. GC will use this delete all blobs for that PG. |
do we need gc for the member out? I think if we remove it from the group , we can just re-fromat it with homestore, then it will be clean , no? |
.via(&folly::InlineExecutor::instance()) | ||
.thenValue([this, member_in_uuid, member_out_uuid](auto&& e) -> AsyncReplResult<> { | ||
// TODO Currently we ignore the cancelled, fix nuraft_mesg to not timeout | ||
// when adding member. Member is added to cluster config until member syncs fully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change the behavior? This is very odd by mixing member change vs data syncing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nuraft behaviour. I had same doubt. This what Jungsang shared.
"""
- there are three servers: s1, s2, s3. So quorum size is 2.
- let's say s3 is lagging behind, but still s1 can make a commit, by forming quorum {s1, s2}.
- now new member s4 joins the cluster.
- if s4 joins the cluster immediately without catch-up, it is also lagging behind.
- since it joins the cluster immediately, now quorum size becomes 3.
- now s1 cannot make a commit, no matter which member is in the quorum: {s1, s2, s3}, {s1, s2, s4}.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this scenario makes sense.
another question , do we have a plan to have a separate add_member and rem_member?
for example, if we have a very heavy read workload for one pg, we want to add some new members to take over some read workload. so we only need add_memeber.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JacksonYao287 we dont want that as of now. We dont have the flexibility to control replication size of each PG in control plane. Lets revisit if we have solid use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanebay that makes sense from the raft point of view, but make it very tricky for cm->sm protocal. Once leader SM reboot, what would be the state of the in-flight add_srv? will it continue till success or being discard? who should take the responsibility to respond to CM?
Is there a possibility that we can add new_member as learner and promoted to participant once it catches up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for a failed/forgot add_srv, who should take the responsibility to reclaim space? If CM retries with same PG to same SM I believe baseline resync will erase and re-do. If CM attempt with other PG to this SM I am not sure how we handle this --- linking to the fixed pg size design @JacksonYao287
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanebay that makes sense from the raft point of view, but make it very tricky for cm->sm protocal. Once leader SM reboot, what would be the state of the in-flight add_srv? will it continue till success or being discard? who should take the responsibility to respond to CM?
Is there a possibility that we can add new_member as learner and promoted to participant once it catches up?
add_srv internally does cluster config change with log entry. So it needs commit . Its treated as same append entry log. If committed and crashed, CM can send again and we return error that server already exists. If it didnt commit, it will rollback once another leader selects. I havent explored the learner option.
also for a failed/forgot add_srv, who should take the responsibility to reclaim space?
No as mentioned, add_srv should fail with error SERVER_ALREADY_EXISTS . As per design, if new member failed after sometime, manual intervnetion is needed and a new PG move is needed(https://github.com/eBay/NuRaft/blob/master/src/handle_join_leave.cxx#L68)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for a failed/forgot add_srv, who should take the responsibility to reclaim space? If CM retries with same PG to same SM I believe baseline resync will erase and re-do. If CM attempt with other PG to this SM I am not sure how we handle this
yes, a failed add_srv will leave some stale data in the disk of member_in , it is caused by the snapshot before the config change is committed.
I have gone through the raft code. I will add the solution in the doc
|
||
LOGINFO("Replace member added member={} to group_id={}", member_in, group_id_str()); | ||
|
||
// Step 3. Append log entry to mark the old member is out and new member is added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more for cm:
if the new leader is the member_in
, we havent respond to the call yet. We get the leader (in nuraft_mesg) from raft which is always up_to_date, and respond to CM.
Wondering if CM expect HB from a to-be-added SM , or expect retry aganist a to-be-added SM
.
One SM could have multiple PG's on same disk. We could move out some PG for space issues or load balance. |
the code generally looks good to me. pls resolve the conflict |
0870f72
to
e98bb96
Compare
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #546 +/- ##
===========================================
+ Coverage 56.51% 67.27% +10.76%
===========================================
Files 108 109 +1
Lines 10300 10448 +148
Branches 1402 1408 +6
===========================================
+ Hits 5821 7029 +1208
+ Misses 3894 2743 -1151
- Partials 585 676 +91 ☔ View full report in Codecov by Sentry. |
e98bb96
to
8186675
Compare
@@ -93,7 +93,10 @@ void RaftReplService::start() { | |||
.with_hb_interval(HS_DYNAMIC_CONFIG(consensus.heartbeat_period_ms)) | |||
.with_max_append_size(HS_DYNAMIC_CONFIG(consensus.max_append_batch_size)) | |||
.with_log_sync_batch_size(HS_DYNAMIC_CONFIG(consensus.log_sync_batch_size)) | |||
// TODO fix the log_gap thresholds when adding new member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared logs with nuraft team, but we couldnt find the root cause immediately. Will have to debug with nuraft team when they are available. Will create a ticket to keep track of this. #561
// Step 3. Append log entry to mark the old member is out and new member is added. | ||
auto rreq = repl_req_ptr_t(new repl_req_ctx{}); | ||
replace_members_ctx members; | ||
std::copy(member_in_uuid.begin(), member_in_uuid.end(), members.in.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can't do member.in = std::move(members_in_uuid)
, can we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a char array as we are serializing two members. members_in_uuid is a string, so we couldnt use it.
|
||
// Check if member_out exists in the group. | ||
std::vector< nuraft::ptr< nuraft::srv_config > > configs_out; | ||
m_msg_mgr.get_srv_config_all(m_group_id, configs_out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the out is not found, shouldn't we just move ahead to add the in_member. The reason is we have retry logic, and we don't need to worry about which operation actually did moved out the member with mix of crash in between and and if one operation did manage to move the old member, shouldn't we just go ahead ignore the out_member not seen in current group and add in_member?
Also we should check in_member is not already there in the raft group. Since we have the retry logic, and this retry needs to be idomponent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can do that. Instead of doing here, we can depend on raft server itself returns nuraft::cmd_result_code::SERVER_ALREADY_EXISTS for add and nuraft::cmd_result_code::SERVER_NOT_FOUND for remove and ignore those errors.
8186675
to
8332bac
Compare
When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.
8332bac
to
afb64fc
Compare
When replacing a member, add the new member, sync raft log for replace and finally remove the old member. Once we add new member, baseline or incremental resync will start. Remove the old member will cause nuraft mesg to exit the group and we periodically gc the destroyed group. Made the repl dev base test common so that both tests files can use. Tests by default create repl group with num_replica's. Dynamic tests create additional spare replica's which can be added to the test dynamically by calling replace member.
Testing