investigate discrepancy in expected number of uploaded block parts per P2P connection #1138

staheri14 · 2023-11-15T22:35:46Z

Previous experiments have revealed that the expected number of uploaded block parts per P2P connection exceeds 1, which contradicts our initial analysis. This issue aims to investigate the root cause of this misalignment.

Addressing this issue is a prerequisite for #1134, as clarifying the expected traffic from block part propagation is essential to evaluating the performance optimization proposed in #1134.

Additionally, you can find the results of the experiments provided below for your convenience.

After examining the influxDB data points concerning block part exchanges for a node connected to mocha-4, I have compiled findings from a 1-hour period covering approximately 300 blocks.

Observations: The provided plots indicate that the behavior for Downloaded block parts aligns with expectations: each block part index appears only once per p2p connection for every block 'height' and 'round'. However, the pattern diverges when considering the Uploaded block parts. In this case, multiple instances of the same block part index have been uploaded during the same height and round over a single p2p connection, which is contrary to the expected behavior. I will be investigating it further and will share my findings.

staheri14 · 2023-11-15T23:39:40Z

One theory regarding this discrepancy was that the peer state might not have been properly updated during the send and receive operations, resulting in multiple uploads of the same part. An issue that appeared to be addressed by adding a deliberate sleep period in the following PR from cometbft. However, based on my initial investigation of the PR changes and its description, the sleep time added in that PR seems to be intended to allow other update messages (sent over other channels) to arrive and take effect. This differs from the current issue at hand, where the block parts sent by a node to another peer do not appear to be properly recorded in the peer state managed locally by the node.

staheri14 · 2023-11-17T00:25:38Z

My initial suspicion is that the peer state is asynchronously reset (due to the messages exchanged over state channel), resulting in the loss of block part history sent to that peer. I am investigating this hypothesis, and below are some of the aspects I have examined so far :

Can the peer state be falsely updated by receiving a new round step message? No, the state reset depends solely on changes in height or round within the new round step message (link).
Is this issue specific to multi-round heights? No, there is no correlation between duplicate block parts and multi-round heights. I've inspected the InfluxDB data and can confirm that this occurs even with normal (single round) blocks.
Does this problem occur at a specific part index? No, duplicates can be observed at various block part indices. This is confirmed by inspecting influxDB data.
Does it only affect empty blocks? No, based on the InfluxDB data points, this issue also occurs with larger blocks.

Currently I am working on running an e2e test to test a few tweaks which I think might be the fix for this issue. Will keep posting my findings here.

staheri14 · 2023-11-24T02:02:09Z

I have been working on a test network using celestia-core with kvstore as the application (a network with 4 fully connected nodes). I was able to confirm that even in a local test there are cases in which the number of downloaded and uploaded parts over a p2p connection exceeds the ideal maximum of one. However, this occurs infrequently, e.g., once in every two to three test runs. This inconsistency contrasts with the more frequent occurrences observed on a live node in mocha-4. Due to this irregularity, it's difficult to accurately assess the effectiveness of any fixes, as it's unclear whether observed improvements are results of the fixes or just random success. To better understand and resolve this issue, I intend to delve deeper using the new testground tooling.

staheri14 · 2023-12-01T02:44:14Z

I successfully conducted a test using testground and gathered data. However, despite multiple test executions, I haven't encountered the discrepancy mentioned in the issue yet (the uploaded and downloaded block parts per connection are always 1 with no duplicate). I'll continue experimenting (e.g., changing the testing parameters) to replicate the issue and diagnose its root cause.

staheri14 · 2023-12-04T23:42:14Z

Given that the network issue is not reproducible in local tests (where it behaves as expected), I will use this opportunity to address the other issue i.e., #1134 that was previously hindered by the network's unpredictable behavior. This no longer seems to be a concern in local environments. For now, I'll move this network issue to the backlog and revisit it later. Meanwhile, I'll continue to experiment in the background and will return to this issue once I can reliably reproduce the problem.

staheri14 · 2023-12-06T18:48:45Z

I believe I've pinpointed a possible cause for the duplication of blockparts in each peer-to-peer connection. My hypothesis is that the InfluxDB client might be resubmitting duplicates of the same data point as a retry mechanism in response to network failures. This hypothesis is consistent with my earlier observations indicating that the problem cannot be reproduced on a local testnet (where the influxdb client and server are running on the same machine), where such network instability is absent.

Here's a detailed explanation for this suspicion:

Upon examining a node connected to Mocha, I observed multiple entries for the same height, round, step, and node_id in the consensus_round_state table. This is an anomaly because, typically, the consensus reactor transitions through a height, round, step sequence only once during its lifecycle. Closer inspection of the timestamps for these duplicate records revealed they are only seconds apart, suggesting a rapid resubmission.

I found an example of this issue, as shown in the timestamp (_time) column of the duplicate records in the following screenshot:

To further investigate and validate this hypothesis, I plan to create a separate issue and consider making necessary code modifications.

evan-forbes · 2023-12-06T21:20:09Z

entries for the same height, round, step, and node_id

sorry I think I'm missing something simple, it looks like the example is a different round?

staheri14 · 2023-12-06T21:43:44Z

entries for the same height, round, step, and node_id

sorry I think I'm missing something simple, it looks like the example is a different round?

In the snapshot I have shared, I've restructured the dataframe so that the timestamps of duplicate records are now consolidated under the _time column. Upon examining the data more closely, you will notice in each row there are two (or more) timestamps representing the two different times that a given combination of height, round, and step occurred within a one-hour window.
For example in the row below, one time stamp is 00:52:16 and the other is 00:52:17.

staheri14 · 2023-12-06T21:50:13Z

I am including an additional view from the past hour (which is different from the prior snapshot, yet the pattern is the same) to further clarify the data:

evan-forbes · 2023-12-07T01:56:42Z

ahhhhh okay I see, thanks for explaining, yikes that's not good

staheri14 self-assigned this Nov 15, 2023

staheri14 changed the title ~~uploaded block parts exceed the expected number~~ investigate discrepancy in expected number of uploaded block parts per P2P connection Nov 15, 2023

staheri14 mentioned this issue Dec 6, 2023

Investigating potential duplicate data submission by InfluxDB Client due to network failures #1146

Closed

evan-forbes mentioned this issue Jun 11, 2024

Stop overwriting msgBytes when receiving data cometbft/cometbft#3237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate discrepancy in expected number of uploaded block parts per P2P connection #1138

investigate discrepancy in expected number of uploaded block parts per P2P connection #1138

staheri14 commented Nov 15, 2023 •

edited

Loading

staheri14 commented Nov 15, 2023

staheri14 commented Nov 17, 2023 •

edited

Loading

staheri14 commented Nov 24, 2023

staheri14 commented Dec 1, 2023

staheri14 commented Dec 4, 2023

staheri14 commented Dec 6, 2023 •

edited

Loading

evan-forbes commented Dec 6, 2023

staheri14 commented Dec 6, 2023

staheri14 commented Dec 6, 2023 •

edited

Loading

evan-forbes commented Dec 7, 2023 •

edited

Loading

investigate discrepancy in expected number of uploaded block parts per P2P connection #1138

investigate discrepancy in expected number of uploaded block parts per P2P connection #1138

Comments

staheri14 commented Nov 15, 2023 • edited Loading

staheri14 commented Nov 15, 2023

staheri14 commented Nov 17, 2023 • edited Loading

staheri14 commented Nov 24, 2023

staheri14 commented Dec 1, 2023

staheri14 commented Dec 4, 2023

staheri14 commented Dec 6, 2023 • edited Loading

evan-forbes commented Dec 6, 2023

staheri14 commented Dec 6, 2023

staheri14 commented Dec 6, 2023 • edited Loading

evan-forbes commented Dec 7, 2023 • edited Loading

staheri14 commented Nov 15, 2023 •

edited

Loading

staheri14 commented Nov 17, 2023 •

edited

Loading

staheri14 commented Dec 6, 2023 •

edited

Loading

staheri14 commented Dec 6, 2023 •

edited

Loading

evan-forbes commented Dec 7, 2023 •

edited

Loading