Fix sync of beacon nodes on MacOS Holesky host #201

jakubgs · 2024-09-24T13:07:59Z

Currently two nodes on macm2-01.ih-eu-mda1.nimbus.holesky are failing the BeaconChainNoAggregatesReceived metrics check:

And the API does confirm that:

[email protected]:~ % for port in $(seq 9301 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"979007","sync_distance":"1627664","is_syncing":true,"is_optimistic":true,"el_offline":true}}
{"data":{"head_slot":"1400927","sync_distance":"1205744","is_syncing":true,"is_optimistic":true,"el_offline":false}}
{"data":{"head_slot":"2606671","sync_distance":"0","is_syncing":false,"is_optimistic":false,"el_offline":false}}

The text was updated successfully, but these errors were encountered:

jakubgs · 2024-09-24T13:10:33Z

If we look at stable node on macm2-01.ih-eu-mda1.nimbus.holesky we can see that the sync distance is just growing:

The testing node has a similar pattern:

While unstable appears to be fine:

jakubgs · 2024-09-24T13:12:01Z

The interesting part is that both stable:

And testing:

Are unable to get to max peers, while unstable gets there without issues:

jakubgs · 2024-09-24T13:13:36Z

But LibP2P ports for all 3 nodes are correctly open on the firewall:

 > nmap -Pn -p9001-9003 macm2-01.ih-eu-mda1.nimbus.holesky.status.im
Starting Nmap 7.94 ( https://nmap.org ) at 2024-09-24 15:13 CEST
Nmap scan report for macm2-01.ih-eu-mda1.nimbus.holesky.status.im (185.181.230.76)
Host is up (0.057s latency).
rDNS record for 185.181.230.76: no-rdns.innovahosting.net

PORT     STATE SERVICE
9001/tcp open  tor-orport
9002/tcp open  dynamid
9003/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.12 seconds

So what is different about stable and testing that makes connectivity so bad?

jakubgs · 2024-09-24T13:16:36Z

What's also interesting is that their respective Geth nodes on `` have different matching state too:

[email protected]:~ % for dir in /docker/geth-*; do $dir/rpc.sh eth_syncing | jq -c; done
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":false}
{"jsonrpc":"2.0","id":1,"result":{"currentBlock":"0x0","healedBytecodeBytes":"0x0","healedBytecodes":"0x0","healedTrienodeBytes":"0x0","healedTrienodes":"0x0","healingBytecode":"0x0","healingTrienodes":"0x0","highestBlock":"0x0","startingBlock":"0x0","syncedAccountBytes":"0x0","syncedAccounts":"0x0","syncedBytecodeBytes":"0x0","syncedBytecodes":"0x0","syncedStorage":"0x0","syncedStorageBytes":"0x0","txIndexFinishedBlocks":"0x0","txIndexRemainingBlocks":"0x1"}}

The Geth node for unstable is not synced, while ones for stable and testing appear to be synced:

[email protected]:~ % sudo du -hsc /docker/*
131G	/docker/geth-holesky-01
150G	/docker/geth-holesky-02
4.4G	/docker/geth-holesky-03
284G	total

This is very confusing.

jakubgs · 2024-09-24T13:32:28Z

I found some EL connection failures in stable node logs:

{"lvl":"DBG","ts":"2024-09-24 16:30:46.299+03:00","msg":"Sending message to RPC server","topics":"JSONRPC-HTTP-CLIENT","address":{"oResultPrivate":true,"vResultPrivate":{"id":"holesky-01.ih-eu-mda1.nimbus.geth.wg:8551","scheme":"NonSecure","hostname":"holesky-01.ih-eu-mda1.nimbus.geth.wg","port":8551,"path":"","query":"","anchor":"","username":"","password":"","addresses":[{"family":"IPv4","address_v4":[10,15,0,12],"port":8551}]}},"msg_len":342,"name":"engine_forkchoiceUpdatedV1"}
{"lvl":"DBG","ts":"2024-09-24 16:30:46.300+03:00","msg":"Cancelled POST Request with JSON-RPC","topics":"JSONRPC-HTTP-CLIENT","e":"Future operation cancelled!"}

And indeed the port is closed:

[email protected]:~ % sudo nmap -Pn -p8551-8553 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:31 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00085s latency).

PORT     STATE  SERVICE
8551/tcp closed unknown
8552/tcp open   unknown
8553/tcp open   unknown

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds

But the one for testing node is open, so that doesn't explain low peers at all.

jakubgs · 2024-09-24T13:36:29Z

The geth-holesky-01-node logs on holesky-01.ih-eu-mda1.nimbus.geth show a lot of these:

WARN [09-24|13:30:34.987] Beacon client online, but no consensus updates received in a while. Please fix your beacon client to follow the chain!

But that would make sense since that BN is not actually synced.

jakubgs · 2024-09-24T13:37:01Z

Oh, I see, the port start is wrong, it should start at 8552:

[email protected]:~ % sudo nmap -Pn -p8552-8554 holesky-01.ih-eu-mda1.nimbus.geth.wg
Starting Nmap 7.95 ( https://nmap.org ) at 2024-09-24 16:36 EEST
Nmap scan report for holesky-01.ih-eu-mda1.nimbus.geth.wg (10.15.0.12)
Host is up (0.00079s latency).

PORT     STATE SERVICE
8552/tcp open  unknown
8553/tcp open  unknown
8554/tcp open  rtsp-alt

Nmap done: 1 IP address (1 host up) scanned in 0.02 seconds

This would also explain why last Geth node is not synced. Since it has no BN connected to it.

jakubgs · 2024-09-24T13:47:42Z

Fixed Geth ports:

infra-nimbus#126afac1 - holesky: fix Geth port numbers on macm2-01

jakubgs · 2024-09-25T08:38:44Z

This drop in peers of stable node is bizzare:

It was doing okay and then it crashes and burns:

jakubgs · 2024-09-25T08:39:04Z

There's not a single error in logs, just some warning:

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep WRN
{"lvl":"WRN","ts":"2024-09-25 07:31:20.367+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":3,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"DBG","ts":"2024-09-25 07:35:24.368+03:00","msg":"Peer connection upgraded","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":1}
{"lvl":"DBG","ts":"2024-09-25 07:35:25.524+03:00","msg":"Peer disconnected","topics":"networking","peer":"16Uiu2HAmRnBjeA9DxxWRNno3kx7bFZuoXaB9QXrpf5kZS32P6LWn","connections":0}
{"lvl":"WRN","ts":"2024-09-25 07:35:27.286+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":80,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:35:55.855+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":1,"new_peers":[],"current_peers":79,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:07.224+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":0,"new_peers":[],"current_peers":76,"wanted_peers":320}
{"lvl":"WRN","ts":"2024-09-25 07:36:34.411+03:00","msg":"Peer count low, no new peers discovered","topics":"networking","discovered_nodes":5,"new_peers":[],"current_peers":78,"wanted_peers":320}

But the disconnects are higher then number of successful connections:

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Lost connection to peer' | wc -l
     255

[email protected]:~/beacon-node-holesky-stable/logs % grep ' 07:3' service.20240925_0747.log | grep 'Peer successfully connected' | wc -l
     133

jakubgs · 2024-09-25T08:39:59Z

At the same time the unstable node had absolutely no issues with connectivity on the same host:

If network issues were the problem this node would have them too, but it doesn't.

jakubgs · 2024-09-25T08:41:12Z

Here's the full log at that time, I can't find anything relevant: service.20240925_0747.log.gz

jakubgs · 2024-09-26T06:43:15Z

I have purged testing node as stable appears to be syncing better than before:

jakubgs self-assigned this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sync of beacon nodes on MacOS Holesky host #201

Fix sync of beacon nodes on MacOS Holesky host #201

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024 •

edited

Loading

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 26, 2024

Fix sync of beacon nodes on MacOS Holesky host #201

Fix sync of beacon nodes on MacOS Holesky host #201

Comments

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 • edited Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 24, 2024 • edited Loading

jakubgs commented Sep 24, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024 • edited Loading

jakubgs commented Sep 25, 2024

jakubgs commented Sep 25, 2024

jakubgs commented Sep 26, 2024

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 24, 2024 •

edited

Loading

jakubgs commented Sep 25, 2024 •

edited

Loading