Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass instant from aptosdb for calculating latency metric #15678

Merged
merged 1 commit into from
Jan 24, 2025

Conversation

areshand
Copy link
Contributor

@areshand areshand commented Jan 6, 2025

Description

How Has This Been Tested?

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Jan 6, 2025

⏱️ 2h 10m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-targeted-unit-tests 48m 🟥
rust-move-tests 14m 🟩
rust-move-tests 14m 🟩
rust-move-tests 13m 🟩
rust-cargo-deny 10m 🟩🟩🟩🟩🟩 (+1 more)
rust-move-tests 9m
check-dynamic-deps 8m 🟩🟩🟩🟩🟩 (+1 more)
rust-move-tests 8m
general-lints 3m 🟩🟩🟩🟩🟩 (+1 more)
semgrep/ci 2m 🟩🟩🟩🟩🟩 (+1 more)
file_change_determinator 1m 🟩🟩🟩🟩🟩 (+1 more)
permission-check 17s 🟩🟩🟩🟩🟩 (+1 more)
permission-check 16s 🟩🟩🟩🟩🟩 (+1 more)

🚨 1 job on the last run was significantly faster/slower than expected

Job Duration vs 7d avg Delta
check-dynamic-deps 2m 1m +107%

settingsfeedbackdocs ⋅ learn more about trunk.io

@areshand areshand force-pushed the pass_instant_latency branch 3 times, most recently from 0988f9c to 5f9b440 Compare January 7, 2025 05:12
@@ -166,31 +166,32 @@ impl InternalIndexerDBService {

pub async fn run(&mut self, node_config: &NodeConfig) -> Result<()> {
let mut start_version = self.get_start_version(node_config).await?;
let mut target_version = self.db_indexer.main_db_reader.ensure_synced_version()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will error out with an empty db (before genesis is put in), is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, internal indexer is supposed to start after main db bootstrapped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay..

log_grpc_step(
SERVICE_TYPE,
IndexerGrpcStep::InternalIndexerDBProcessed,
Some(start_version as i64),
Some(next_version as i64),
None,
None,
Some(start_time.elapsed().as_secs_f64()),
Some(step_timer.elapsed().as_secs_f64()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue with yesterday, that before it's caught up you already logged the latency, which is not accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each loop now is blocked on notification of write to main db. Previously, each loop is only a batch of all updates. so this should reflect the latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@areshand areshand force-pushed the pass_instant_latency branch 2 times, most recently from 1db3ecd to cc5d099 Compare January 10, 2025 23:59
@areshand areshand requested a review from msmouse January 13, 2025 18:52
@areshand areshand force-pushed the pass_instant_latency branch 2 times, most recently from 16ad465 to 072b5ff Compare January 13, 2025 22:43
log_grpc_step(
SERVICE_TYPE,
IndexerGrpcStep::InternalIndexerDBProcessed,
Some(start_version as i64),
Some(next_version as i64),
None,
None,
Some(start_time.elapsed().as_secs_f64()),
Some(step_timer.elapsed().as_secs_f64()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@@ -166,31 +166,32 @@ impl InternalIndexerDBService {

pub async fn run(&mut self, node_config: &NodeConfig) -> Result<()> {
let mut start_version = self.get_start_version(node_config).await?;
let mut target_version = self.db_indexer.main_db_reader.ensure_synced_version()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay..

Comment on lines +178 to +180
Err(e) => {
panic!("Failed to get update from update_receiver: {}", e);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how this thread knows the main db has quit, right? In that case we should return Ok instead of suicide here?

And I realized even in the previous logic, this thread doesn't have a chance to quit until hitting the target version? It can be an issue when the first time the indexer is enabled? (granted we usually quit the maindb by killing the whole process, it'd be better if we deal with this case gracefully, if not too complicated.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the DBindexer should have Arc of main_db. I expect the err should never be caused by main_db quit. (The main db quits after db indexer being dropped). Thus, I think it should panic if this unexpected error occurs.
  2. if the process is killed while the indexer is in the middle of processing a large amount of data. the thread is terminated the same way as other components reading main_db eg: API is reading transactions using storage interface and the node is killed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, in that case the indexer loop never quits? That should be an issue in tests?

Shall we implement graceful quitting (separately)? @grao1991 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree.

Copy link
Contributor Author

@areshand areshand Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a method called "run_with_end_version" for testing

@areshand
Copy link
Contributor Author

@grao1991 please take a look as well ?

@areshand areshand enabled auto-merge (rebase) January 16, 2025 18:44

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@areshand areshand force-pushed the pass_instant_latency branch from 4f6258c to 17b78fb Compare January 21, 2025 16:47

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@areshand areshand force-pushed the pass_instant_latency branch from 17b78fb to ed2b26d Compare January 23, 2025 18:34

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@areshand areshand force-pushed the pass_instant_latency branch from ed2b26d to 1eeadc7 Compare January 24, 2025 04:22
@areshand areshand disabled auto-merge January 24, 2025 04:22

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on d15fc969c89551a1461d931d327b8d4dbfb2f814 ==> 1eeadc71037ceb57e31929187640e2b976c72a2a

Compatibility test results for d15fc969c89551a1461d931d327b8d4dbfb2f814 ==> 1eeadc71037ceb57e31929187640e2b976c72a2a (PR)
1. Check liveness of validators at old version: d15fc969c89551a1461d931d327b8d4dbfb2f814
compatibility::simple-validator-upgrade::liveness-check : committed: 12429.07 txn/s, latency: 2595.52 ms, (p50: 2500 ms, p70: 2900, p90: 3300 ms, p99: 4700 ms), latency samples: 407620
2. Upgrading first Validator to new version: 1eeadc71037ceb57e31929187640e2b976c72a2a
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 4298.44 txn/s, latency: 7251.60 ms, (p50: 8100 ms, p70: 8700, p90: 9000 ms, p99: 9100 ms), latency samples: 92480
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 4339.81 txn/s, latency: 7822.41 ms, (p50: 8800 ms, p70: 8900, p90: 9100 ms, p99: 9300 ms), latency samples: 152720
3. Upgrading rest of first batch to new version: 1eeadc71037ceb57e31929187640e2b976c72a2a
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 4513.44 txn/s, latency: 6814.21 ms, (p50: 7600 ms, p70: 8200, p90: 8400 ms, p99: 8600 ms), latency samples: 94640
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 4500.07 txn/s, latency: 7553.69 ms, (p50: 8400 ms, p70: 8500, p90: 8800 ms, p99: 9000 ms), latency samples: 158740
4. upgrading second batch to new version: 1eeadc71037ceb57e31929187640e2b976c72a2a
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 7622.51 txn/s, latency: 3997.73 ms, (p50: 4500 ms, p70: 4900, p90: 5500 ms, p99: 5700 ms), latency samples: 141360
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 7708.76 txn/s, latency: 4464.20 ms, (p50: 4700 ms, p70: 5100, p90: 5400 ms, p99: 5700 ms), latency samples: 256560
5. check swarm health
Compatibility test for d15fc969c89551a1461d931d327b8d4dbfb2f814 ==> 1eeadc71037ceb57e31929187640e2b976c72a2a passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 1eeadc71037ceb57e31929187640e2b976c72a2a

two traffics test: inner traffic : committed: 14483.71 txn/s, latency: 2737.60 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3300 ms), latency samples: 5507040
two traffics test : committed: 99.98 txn/s, latency: 1461.99 ms, (p50: 1400 ms, p70: 1500, p90: 1600 ms, p99: 1600 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.536, avg: 1.471", "ConsensusProposalToOrdered: max: 0.295, avg: 0.294", "ConsensusOrderedToCommit: max: 0.413, avg: 0.399", "ConsensusProposalToCommit: max: 0.707, avg: 0.692"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.73s no progress at version 23450 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.92s no progress at version 6235486 (avg 0.81s) [limit 16].
Test Ok

@areshand areshand merged commit 7ca6db3 into main Jan 24, 2025
46 checks passed
@areshand areshand deleted the pass_instant_latency branch January 24, 2025 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants