Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(resharding): Adjust State mapping check for single shard tracking #12706

Merged
merged 7 commits into from
Jan 10, 2025

Conversation

staffik
Copy link
Contributor

@staffik staffik commented Jan 8, 2025

Unblocks #12691

Changes

  • Adjust check_state_shard_uid_mapping_after_resharding so that it can be run for a client that does not track all shards.
  • Run check_state_shard_uid_mapping_after_resharding for each client.
  • Slightly refactor (simplify) resharding test loop.

@staffik staffik requested a review from a team as a code owner January 8, 2025 23:05
Copy link
Contributor

@marcelo-gonzalez marcelo-gonzalez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to just leave comments instead of approving because I'm not super familiar with the state mapping code, so I would need to check it some more before understanding why the change here is allowed, but I'll come back to it tmr if it hasnt been approved yet

.get_prev_epoch_id_from_prev_block(&tip.prev_block_hash)
.unwrap();
let epoch_config = client.epoch_manager.get_epoch_config(&prev_epoch_id).unwrap();
let epoch_id =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also just use epoch_id() on the block_header variable stored above. Also, do we need this change? I mean I guess it's all the same anyway, but could just keep the old one since it's shorter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed in fa8e29e

client: &Client,
prev_block_hash: &CryptoHash,
) -> Vec<ShardUId> {
let account_id =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could avoid cloning:

diff --git a/integration-tests/src/test_loop/utils/sharding.rs b/integration-tests/src/test_loop/utils/sharding.rs
index 9db148659..5dab61245 100644
--- a/integration-tests/src/test_loop/utils/sharding.rs
+++ b/integration-tests/src/test_loop/utils/sharding.rs
@@ -123,14 +123,14 @@ pub fn get_tracked_shards_from_prev_block(
     client: &Client,
     prev_block_hash: &CryptoHash,
 ) -> Vec<ShardUId> {
-    let account_id =
-        client.validator_signer.get().map(|validator| validator.validator_id().clone());
+    let signer = client.validator_signer.get();
+    let account_id = signer.as_ref().map(|s| s.validator_id());
     let mut tracked_shards = vec![];
     for shard_uid in
         client.epoch_manager.get_shard_layout_from_prev_block(prev_block_hash).unwrap().shard_uids()
     {
         if client.shard_tracker.care_about_shard(
-            account_id.as_ref(),
+            account_id,
             prev_block_hash,
             shard_uid.shard_id(),
             true,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in fa8e29e

.iter()
.any(|child_shard_uid| shards_tracked_after_resharding.contains(child_shard_uid))
{
assert_eq!(shard_uid_mapping.len(), 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this makes me think of another test case to add: chunk producer tracks the parent and a child after reshrding, then gets assigned to an unrelated shard, then in the future state syncs the child again. then the state mapping will still exist but the state application in state sync will just write the child shard uid directly in the DB. Wonder what happens in that case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the child shard is no longer tracked doesn't the mapping get removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State cleanup will deal with cleaning resharding mapping. For now, once mapping is set it will be used forever for the child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this makes me think of another test case to add: chunk producer tracks the parent and a child after reshrding, then gets assigned to an unrelated shard, then in the future state syncs the child again. then the state mapping will still exist but the state application in state sync will just write the child shard uid directly in the DB. Wonder what happens in that case

I am pretty sure that was covered by shard shuffling test, but added test_resharding_v3_stop_track_child_for_2_epochs that would be helpful to test state cleanup.

@@ -485,56 +483,47 @@ fn test_resharding_v3_base(params: TestReshardingParameters) {

let client = clients[client_index];
let block_header = client.chain.get_block_header(&tip.last_block_hash).unwrap();
let shard_layout = client.epoch_manager.get_shard_layout(&tip.epoch_id).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection from me on removing this and the print below, but I wonder if anybody was using these print statements while debugging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think those are good to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't used them myself. We can add them back locally to debug if needed 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave on of them, because they duplicate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in fa8e29e

epoch_config.shard_layout.num_shards() as usize
);
}
// If any child shard was tracked after resharding, it means the node had to split the parent shard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering what happens if the node tracks the parent both not any child

We do resharding for nothing? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added a test in fa8e29e

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -485,56 +483,47 @@ fn test_resharding_v3_base(params: TestReshardingParameters) {

let client = clients[client_index];
let block_header = client.chain.get_block_header(&tip.last_block_hash).unwrap();
let shard_layout = client.epoch_manager.get_shard_layout(&tip.epoch_id).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think those are good to have.

for shard_uid in
client.epoch_manager.get_shard_layout_from_prev_block(prev_block_hash).unwrap().shard_uids()
{
if client.shard_tracker.care_about_shard(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bit of a mismatch between the function name and the implementation here.

The care_about_shard only checks if the client is a producer for given shard in this epoch. The client might also track a shard because it will be the producer in the next epoch or because it is configured to track all shards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

care_about_shard also checks tracked shards, if me is true:

 match self.tracked_config {
            TrackedConfig::AllShards => {
                // Avoid looking up EpochId as a performance optimization.
                true
            }
            _ => self.tracks_shard(shard_id, parent_hash).unwrap_or(false),
        }

@@ -350,17 +355,35 @@ pub fn check_state_shard_uid_mapping_after_resharding(
epoch_config.shard_layout.get_children_shards_uids(parent_shard_uid.shard_id()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit though not your code:
It's more typical to just get the shard layout directly from the epoch manager instead of going through the epoch config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in fa8e29e

Comment on lines 346 to 347
prev_block_hash: &CryptoHash,
resharding_block_hash: &CryptoHash,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange to pass the prev_block_hash and resharding_block_hash, and get the tip from the client on the first line of this method. Mixing "pure"-ish function approach with "stateful" approach is asking for trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in fa8e29e

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.66%. Comparing base (0dc410a) to head (5459d25).
Report is 8 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12706      +/-   ##
==========================================
+ Coverage   70.48%   70.66%   +0.18%     
==========================================
  Files         848      848              
  Lines      173673   173788     +115     
  Branches   173673   173788     +115     
==========================================
+ Hits       122418   122815     +397     
+ Misses      46143    45845     -298     
- Partials     5112     5128      +16     
Flag Coverage Δ
backward-compatibility 0.16% <ø> (-0.01%) ⬇️
db-migration 0.16% <ø> (?)
genesis-check 1.36% <ø> (?)
linux 69.22% <0.00%> (+0.08%) ⬆️
linux-nightly 70.26% <100.00%> (+<0.01%) ⬆️
pytests 1.66% <ø> (+1.49%) ⬆️
sanity-checks 1.47% <ø> (?)
unittests 70.50% <100.00%> (+0.01%) ⬆️
upgradability 0.20% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@Trisfald Trisfald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@staffik staffik enabled auto-merge January 10, 2025 10:19
@staffik staffik force-pushed the stafik/resharding/check-mapping branch from 7f6d7bb to e9a8b26 Compare January 10, 2025 13:14
@staffik staffik force-pushed the stafik/resharding/check-mapping branch from e9a8b26 to 73d28fd Compare January 10, 2025 13:43
}

#[test]
fn test_resharding_v3_stop_track_child_for_2_epochs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -692,6 +747,8 @@ fn test_resharding_v3_double_sign_resharding_block() {
}

#[test]
// TODO(resharding): fix nearcore and un-ignore this test
#[ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened for test_resharding_v3_shard_shuffling and test_resharding_v3_shard_shuffling_intense, which change made them fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing test duration, looks like there is some bug, will describe soon.

@staffik staffik disabled auto-merge January 10, 2025 14:31
@Trisfald Trisfald self-requested a review January 10, 2025 15:01
@Trisfald
Copy link
Contributor

At this stage of the PR we have new failing tests, and different options how to approach this, for instance:

  1. keep the tests ignored and fix later
  2. somehow use shorter test duration for some tests to avoid failures, and investigate further in another task
  3. revert to use shorter test duration everywhere (it it makes sense)
  4. fix all issues in this PR

Personally the only one I don't like much is 1.

@staffik
Copy link
Contributor Author

staffik commented Jan 10, 2025

At this stage of the PR we have new failing tests, and different options how to approach this, for instance:

  1. keep the tests ignored and fix later
  2. somehow use shorter test duration for some tests to avoid failures, and investigate further in another task
  3. revert to use shorter test duration everywhere (it it makes sense)
  4. fix all issues in this PR

Personally the only one I don't like much is 1.

Reverted ignores, as there are at least 4 failing tests and I do not want all of them not being run in other PRs.
However, needed to ignore the new test test_resharding_v3_stop_track_child_for_2_epochs as it requires longer test duration.
It looks there is a bug unrelated to this PR, that only shows up if we run tests for a few epochs longer. Described on zulip.

@marcelo-gonzalez Do you think the pattern of tracked shards that I posted on zulip could make state sync into trouble?

UPDATE: maybe never mind, test_resharding_v3_storage_operations failed too and it does not have shard shuffling enabled.

@staffik staffik added this pull request to the merge queue Jan 10, 2025
Merged via the queue into master with commit 1885afa Jan 10, 2025
28 checks passed
@staffik staffik deleted the stafik/resharding/check-mapping branch January 10, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants