-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: 257 Replication Fetch Remote Data #290
Conversation
auto const& lsn = req->lsn(); | ||
auto const& term = req->raft_term(); | ||
auto const& dsn = req->dsn(); | ||
auto const& header = req->user_header(); | ||
auto const& key = req->user_key(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not find where are them used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are for the use of resync from any member purpose which will comes in next
MultiBlkId local_blkid; | ||
|
||
// convert remote_blkid serialized data to local blkid | ||
local_blkid.deserialize(sisl::blob{remote_blkid->Data(), remote_blkid->size()}, true /* copy */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check local_blkid and remote_blkid like below to avoid some abnormal case?
if (local_size > remote_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should be strictly same size, since this is a deserialize/serialize which provided by MultiBlkId itself. There is unit test case cover these apis in MultiBlkId.
} | ||
|
||
// copy by value to avoid since it is on stack; | ||
rpc_data->set_comp_cb([sgs_vec](boost::intrusive_ptr< sisl::GenericRpcData >&) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if error happens when sending response, we also need to release the buffer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if somehow the underlying grpc connection (which is a TCP connection) is stuck, causing comp cb can't be triggered, all connections will be stuck causing either a timeout or a reboot of this replica. Is this what you are concerning? We can discuss in "error handling" strategy as well for this as well
ctx->cv.notify_one(); | ||
}); | ||
} else { | ||
// TODO: if we are not the originator, we need to fetch based on lsn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a lsn to multiblkId map? i think this need to be kept in metaservice or index table, since log might be purged when compaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed, there will be a new api in listener to query index_table from homeObj to return the blkids. We don't rely on logs.
// if in resync mode, fetch data from remote immediately; | ||
check_and_fetch_remote_data(std::move(rreqs)); | ||
check_and_fetch_remote_data(rreqs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is line 488 and line 490 the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there is some problem enabling the non-recursive timer. I will have to work on them and enable them in comming PR.
|
||
if (rreqs->size()) { | ||
// Some data not completed yet, let's fetch from remote; | ||
fetch_data_from_remote(rreqs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now, we do check_and_fetch_remote_data when after the log append. why can not we delay this action until we need it or using some periodical check thread to do? maybe the push_data requset is on the way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As indicated above, this the exactly what this PR is trying to do. e.g. we need the timer to wait for a configurable time of seconds before we fetch from remote.
continue; | ||
} else { | ||
// aquire lock here to avoid two threads are trying to do the same thing; | ||
std::unique_lock< std::mutex > lg(rreq->state_mtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thenValue will be executed in a single thread, right? and after each log batch append, the request are also different.
so , do we really need the lock here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right about here in this context. The lock is to protect from raft log channel and data channel that could also be allocating blks (in case for some reason data arrives after this fetch remote is sent)
@@ -122,6 +122,8 @@ repl_req_ptr_t RaftStateMachine::transform_journal_entry(nuraft::ptr< nuraft::lo | |||
sisl::blob const key = sisl::blob{header.cbytes() + header.size(), jentry->key_size}; | |||
DEBUG_ASSERT_GT(jentry->value_size, 0, "Entry marked as large data, but value size is notified as 0"); | |||
|
|||
LOGINFO("Received Raft server_id={}, term={}, dsn={}, journal_entry=[{}] ", jentry->server_id, lentry->get_term(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need this trace. as we have trace in line 116
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some additional info in this log, I will merge them in my comming PR>
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #290 +/- ##
==========================================
- Coverage 62.60% 58.93% -3.67%
==========================================
Files 110 110
Lines 8990 9254 +264
Branches 1156 1191 +35
==========================================
- Hits 5628 5454 -174
- Misses 2829 3269 +440
+ Partials 533 531 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thanks for the work!
Changes made:
Testing: