Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout occurred when synchronizing blocks #5913

Closed
xxo1shine opened this issue Jul 16, 2024 · 9 comments · Fixed by #5921
Closed

Timeout occurred when synchronizing blocks #5913

xxo1shine opened this issue Jul 16, 2024 · 9 comments · Fixed by #5921
Assignees
Labels

Comments

@xxo1shine
Copy link
Contributor

System information

OS : Linux
JVM : Oracle Corporation 1.8.0_411 amd64
Version : 4.7.5

Expected Behavior

Blocks can be synchronized and broadcasted normally, and no timeout should occur.

Actual Behavior

I deployed a private chain locally. The two nodes have been synchronizing and broadcasting blocks normally. However, after a while, they suddenly disconnected due to a TIME_OUT. Here is the last interaction log.

23:09:48.561 INFO  [peerClient-23] [net](PeerConnection.java:175) Send peer /192.168.0.15:18888 message type: SYNC_BLOCK_CHAIN
size: 5, start block: Num:61885758,ID:0000000003b04d3e2e00336955e1385188e9b60d8cf10b41dcf7da7477ff51ac, end block Num:61885778,ID:0000000003b04d52c62137c5f12f421309888ed8e3118b7b2a57e1d311dd2b3c
23:09:48.850 INFO  [peerClient-23] [net](P2pEventHandlerImpl.java:168) Receive message from  peer: /192.168.0.15:18888, type: BLOCK_CHAIN_INVENTORY
size: 3, first blockId: Num:61885778,ID:0000000003b04d52c62137c5f12f421309888ed8e3118b7b2a57e1d311dd2b3c, end blockId: Num:61885780,ID:0000000003b04d541a1c5a38f31f3ca8eefa5a77d5f3cc2522a7026e349256f5, remain_num: 0
23:10:19.831 WARN  [peer-status-check] [net](PeerStatusCheck.java:51) Peer /192.168.0.15 not sync for a long time
23:10:19.831 INFO  [peer-status-check] [net](PeerConnection.java:175) Send peer /192.168.0.15:18888 message type: P2P_DISCONNECT
reason: TIME_OUT
@tomatoishealthy
Copy link
Contributor

Can you provide more logs, preferably detailed logs of all nodes within 10 minutes of the disconnection time.

In addition, confirm whether the network connection between nodes is stable.

This is a private network, right? Because I observed that your block height is 61885778.

@xxo1shine
Copy link
Contributor Author

@tomatoishealthy Sorry, I accidentally deleted the log, and I haven't reproduced the problem yet.

Yes, it's a private network, the network is very stable, and the TPS is not very large.

@tomatoishealthy
Copy link
Contributor

If a node finds that there is a problem connecting to another node, you may need to check the logs of both nodes at the same time.

Unfortunately, the logs no longer exist. If you encounter this problem next time, you can keep the logs to facilitate troubleshooting.

@xxo1shine
Copy link
Contributor Author

@tomatoishealthy OK, from the logs, there was no communication for about 30 seconds at the end, and then they disconnected.

@tomatoishealthy
Copy link
Contributor

It sounds like you've already got the relevant logs.

Can you provide them so we can see if there are more details to help solve the problem.

@xxo1shine
Copy link
Contributor Author

It sounds like you've already got the relevant logs.

Can you provide them so we can see if there are more details to help solve the problem.

The log has been posted in the issue.

@tomatoishealthy
Copy link
Contributor

There are many reasons for disconnection. According to the logs provided in the current issue, it may not be possible to accurately locate the cause of the problem.

More evidence may be needed, such as whether the JVM was running normally, whether the machine load was too high, whether the network was stable, etc.

If the problem can be reproduced, you can refer to the above suggestions and troubleshoot, you can also save the scene and provide it, and we can work together to solve it

@zeusoo001
Copy link
Contributor

zeusoo001 commented Jul 18, 2024

@xxo1shine There is a concurrency problem between the synchronization thread and the broadcast thread. This problem is being fixed. Please refer to pr #5921

During the block synchronization process, if the broadcast list has not been received, the synchronization may fail. The detailed process is as follows:

  1. After processing the chain inventory message, set fetchFlag to true.
  2. The scheduler will execute the startFetchSyncBlock method to fetch the block, and at this time, fetchFlag will be set to false.
etchExecutor.scheduleWithFixedDelay(() -> {
 try {
   if (fetchFlag) {
     fetchFlag = false;
     startFetchSyncBlock();
   }
 } catch (Exception e) {
   logger.error("Fetch sync block error", e);
 }
}, 10, 1, TimeUnit.SECONDS);
  1. Since advInvRequest is not empty at this time, peer.isIdle() returns false, so after this scheduling, the block is not obtained, but fetchFlag is set to false and cannot be set back, so the block cannot be obtained later.
private void startFetchSyncBlock() {
 HashMap<PeerConnection, List<BlockId>> send = new HashMap<>();
 tronNetDelegate.getActivePeer().stream()
     .filter(peer -> peer.isNeedSyncFromPeer() && peer.isIdle())
     .filter(peer -> peer.isFetchAble())
  1. Since the block is not obtained, the peer status cannot be updated, the status check will fail, and the connection will be disconnected.

@lvs007
Copy link
Collaborator

lvs007 commented Jul 23, 2024

pr #5921 merge to GreatVoyage-v4.7.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants