Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: 257 Replication Fetch Remote Data #290

Merged
merged 16 commits into from
Feb 1, 2024
Merged

Issue: 257 Replication Fetch Remote Data #290

merged 16 commits into from
Feb 1, 2024

Conversation

yamingk
Copy link
Contributor

@yamingk yamingk commented Jan 26, 2024

Changes made:

  1. on follower, register FETCH_DATA to nuraft_mesg and add fetch remote data logic.
  2. on leader side, handle fetch request and read data based on received remote blkid if this leader is the originator.
  3. on follower, write the received data into drive and mark data written and trigger promise to wake up batch_append to moving forward.
  4. introduce flip to simulate data not received from data channel (ingore data written) to trigger a fetch remote data.
  5. added new test case to trigger fetch remote data

Testing:

  1. unit test passed.
  2. regression test passed.

@yamingk yamingk linked an issue Jan 26, 2024 that may be closed by this pull request
Comment on lines +156 to +160
auto const& lsn = req->lsn();
auto const& term = req->raft_term();
auto const& dsn = req->dsn();
auto const& header = req->user_header();
auto const& key = req->user_key();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not find where are them used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are for the use of resync from any member purpose which will comes in next

MultiBlkId local_blkid;

// convert remote_blkid serialized data to local blkid
local_blkid.deserialize(sisl::blob{remote_blkid->Data(), remote_blkid->size()}, true /* copy */);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check local_blkid and remote_blkid like below to avoid some abnormal case?

if (local_size > remote_size) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be strictly same size, since this is a deserialize/serialize which provided by MultiBlkId itself. There is unit test case cover these apis in MultiBlkId.

}

// copy by value to avoid since it is on stack;
rpc_data->set_comp_cb([sgs_vec](boost::intrusive_ptr< sisl::GenericRpcData >&) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if error happens when sending response, we also need to release the buffer, right?

Copy link
Contributor Author

@yamingk yamingk Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if somehow the underlying grpc connection (which is a TCP connection) is stuck, causing comp cb can't be triggered, all connections will be stuck causing either a timeout or a reboot of this replica. Is this what you are concerning? We can discuss in "error handling" strategy as well for this as well

ctx->cv.notify_one();
});
} else {
// TODO: if we are not the originator, we need to fetch based on lsn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a lsn to multiblkId map? i think this need to be kept in metaservice or index table, since log might be purged when compaction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed, there will be a new api in listener to query index_table from homeObj to return the blkids. We don't rely on logs.

// if in resync mode, fetch data from remote immediately;
check_and_fetch_remote_data(std::move(rreqs));
check_and_fetch_remote_data(rreqs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is line 488 and line 490 the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is some problem enabling the non-recursive timer. I will have to work on them and enable them in comming PR.


if (rreqs->size()) {
// Some data not completed yet, let's fetch from remote;
fetch_data_from_remote(rreqs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now, we do check_and_fetch_remote_data when after the log append. why can not we delay this action until we need it or using some periodical check thread to do? maybe the push_data requset is on the way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As indicated above, this the exactly what this PR is trying to do. e.g. we need the timer to wait for a configurable time of seconds before we fetch from remote.

continue;
} else {
// aquire lock here to avoid two threads are trying to do the same thing;
std::unique_lock< std::mutex > lg(rreq->state_mtx);
Copy link
Contributor

@JacksonYao287 JacksonYao287 Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thenValue will be executed in a single thread, right? and after each log batch append, the request are also different.
so , do we really need the lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right about here in this context. The lock is to protect from raft log channel and data channel that could also be allocating blks (in case for some reason data arrives after this fetch remote is sent)

@@ -122,6 +122,8 @@ repl_req_ptr_t RaftStateMachine::transform_journal_entry(nuraft::ptr< nuraft::lo
sisl::blob const key = sisl::blob{header.cbytes() + header.size(), jentry->key_size};
DEBUG_ASSERT_GT(jentry->value_size, 0, "Entry marked as large data, but value size is notified as 0");

LOGINFO("Received Raft server_id={}, term={}, dsn={}, journal_entry=[{}] ", jentry->server_id, lentry->get_term(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need this trace. as we have trace in line 116

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some additional info in this log, I will merge them in my comming PR>

@codecov-commenter
Copy link

Codecov Report

Attention: 316 lines in your changes are missing coverage. Please review.

Comparison is base (64600e9) 62.60% compared to head (73b223a) 58.93%.
Report is 2 commits behind head on master.

Files Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 0.00% 195 Missing ⚠️
src/lib/replication/service/raft_repl_service.cpp 0.00% 35 Missing ⚠️
...rc/lib/replication/repl_dev/raft_state_machine.cpp 0.00% 24 Missing ⚠️
src/lib/replication/service/generic_repl_svc.cpp 0.00% 14 Missing ⚠️
src/include/homestore/superblk_handler.hpp 53.57% 11 Missing and 2 partials ⚠️
src/lib/homestore.cpp 58.82% 4 Missing and 3 partials ⚠️
.../lib/replication/log_store/home_raft_log_store.cpp 0.00% 7 Missing ⚠️
src/lib/replication/repl_dev/solo_repl_dev.cpp 0.00% 7 Missing ⚠️
src/include/homestore/replication/repl_dev.h 0.00% 5 Missing ⚠️
src/lib/logstore/log_store_family.cpp 80.00% 3 Missing and 1 partial ⚠️
... and 4 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #290      +/-   ##
==========================================
- Coverage   62.60%   58.93%   -3.67%     
==========================================
  Files         110      110              
  Lines        8990     9254     +264     
  Branches     1156     1191      +35     
==========================================
- Hits         5628     5454     -174     
- Misses       2829     3269     +440     
+ Partials      533      531       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JacksonYao287
JacksonYao287 previously approved these changes Feb 1, 2024
Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. thanks for the work!

@yamingk yamingk merged commit 42cdddb into eBay:master Feb 1, 2024
20 checks passed
@yamingk yamingk deleted the yk_repl_fetch_remote branch February 1, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Resync] Fetch remote data
4 participants