-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate discrepancy in expected number of uploaded block parts per P2P connection #1138
Comments
One theory regarding this discrepancy was that the peer state might not have been properly updated during the send and receive operations, resulting in multiple uploads of the same part. An issue that appeared to be addressed by adding a deliberate sleep period in the following PR from cometbft. However, based on my initial investigation of the PR changes and its description, the sleep time added in that PR seems to be intended to allow other update messages (sent over other channels) to arrive and take effect. This differs from the current issue at hand, where the block parts sent by a node to another peer do not appear to be properly recorded in the peer state managed locally by the node. |
My initial suspicion is that the peer state is asynchronously reset (due to the messages exchanged over state channel), resulting in the loss of block part history sent to that peer. I am investigating this hypothesis, and below are some of the aspects I have examined so far :
Currently I am working on running an e2e test to test a few tweaks which I think might be the fix for this issue. Will keep posting my findings here. |
I have been working on a test network using celestia-core with kvstore as the application (a network with 4 fully connected nodes). I was able to confirm that even in a local test there are cases in which the number of downloaded and uploaded parts over a p2p connection exceeds the ideal maximum of one. However, this occurs infrequently, e.g., once in every two to three test runs. This inconsistency contrasts with the more frequent occurrences observed on a live node in mocha-4. Due to this irregularity, it's difficult to accurately assess the effectiveness of any fixes, as it's unclear whether observed improvements are results of the fixes or just random success. To better understand and resolve this issue, I intend to delve deeper using the new testground tooling. |
I successfully conducted a test using testground and gathered data. However, despite multiple test executions, I haven't encountered the discrepancy mentioned in the issue yet (the uploaded and downloaded block parts per connection are always 1 with no duplicate). I'll continue experimenting (e.g., changing the testing parameters) to replicate the issue and diagnose its root cause. |
Given that the network issue is not reproducible in local tests (where it behaves as expected), I will use this opportunity to address the other issue i.e., #1134 that was previously hindered by the network's unpredictable behavior. This no longer seems to be a concern in local environments. For now, I'll move this network issue to the backlog and revisit it later. Meanwhile, I'll continue to experiment in the background and will return to this issue once I can reliably reproduce the problem. |
I believe I've pinpointed a possible cause for the duplication of blockparts in each peer-to-peer connection. My hypothesis is that the InfluxDB client might be resubmitting duplicates of the same data point as a retry mechanism in response to network failures. This hypothesis is consistent with my earlier observations indicating that the problem cannot be reproduced on a local testnet (where the influxdb client and server are running on the same machine), where such network instability is absent. Here's a detailed explanation for this suspicion: Upon examining a node connected to Mocha, I observed multiple entries for the same I found an example of this issue, as shown in the timestamp ( To further investigate and validate this hypothesis, I plan to create a separate issue and consider making necessary code modifications. |
sorry I think I'm missing something simple, it looks like the example is a different round? |
ahhhhh okay I see, thanks for explaining, yikes that's not good |
Previous experiments have revealed that the expected number of uploaded block parts per P2P connection exceeds 1, which contradicts our initial analysis. This issue aims to investigate the root cause of this misalignment.
Addressing this issue is a prerequisite for #1134, as clarifying the expected traffic from block part propagation is essential to evaluating the performance optimization proposed in #1134.
Additionally, you can find the results of the experiments provided below for your convenience.
After examining the influxDB data points concerning block part exchanges for a node connected to mocha-4, I have compiled findings from a 1-hour period covering approximately 300 blocks.
Observations: The provided plots indicate that the behavior for Downloaded block parts aligns with expectations: each block part index appears only once per p2p connection for every block 'height' and 'round'. However, the pattern diverges when considering the Uploaded block parts. In this case, multiple instances of the same block part index have been uploaded during the same height and round over a single p2p connection, which is contrary to the expected behavior. I will be investigating it further and will share my findings.
The text was updated successfully, but these errors were encountered: