Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: LND falls out of sync when Bitcoin Core's IP address changes #9353

Open
kallerosenbaum opened this issue Dec 12, 2024 · 2 comments
Open
Labels
bug Unintended code behaviour P2 should be fixed if one has time
Milestone

Comments

@kallerosenbaum
Copy link

kallerosenbaum commented Dec 12, 2024

Background

We run two LND nodes in kubernetes, and after restarting the backing Bitcoin Core node, we notice that LND falls out of sync with the blockchain.

This happens because, in our kubernetes environment, the IP address of Bitcoin Core changes when it is restarted. synced_to_chain will become false and no new blocks will be received.

Your environment

  • version of lnd: v0.18.2-beta
  • which operating system (uname -a on *Nix):
    Linux lnd-routing-0 6.8.0-1018-aws #19~22.04.1-Ubuntu SMP Wed Oct 9 17:10:38 UTC 2024 aarch64 Linux
    and Linux 9db991b293cb 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 Linux
  • version of btcd, bitcoind, or other backend: Bitcon Core 27.0
  • any other relevant environment details: We run our stack in kubernetes

Steps to reproduce

I'll show how I reproduce it in regtest, but we get the same issue in production (running in kubernetes) too.

  • We run LND with the following config in docker-compose:
        --listen=0.0.0.0:9735
        --externalip=lnd-0
        --rpclisten=0.0.0.0:10009
        --bitcoin.active
        --bitcoin.node=bitcoind
        --bitcoin.regtest
        --bitcoind.rpcuser=test
        --bitcoind.rpcpass=password
        --bitcoind.rpchost=bitcoin:18443
        --bitcoind.zmqpubrawblock=tcp://bitcoin:18501
        --bitcoind.zmqpubrawtx=tcp://bitcoin:18502
        --norest
        --protocol.wumbo-channels

When running this, bitcoin resolves to 172.18.0.2.

  • Build some blocks and make sure LND is in sync by running lncli -network=regtest getinfo and check that synced_to_chain is true.
  • Stop bitcoin core, and restart it again, but this time make sure it gets a new IP address, so from now on bitcoin resolves to e.g. 172.18.0.6.
  • Build a block
  • Run lncli -network=regtest getinfo. synced_to_chain will be false, but block_height and block_hash will be the most recent one.

After this, LND will not receive any new blocks, but it has apparently reconnected (presumably through RPC) to get the latest block hash. My guess is that ZMQ stops working due to the IP address change.

Expected behaviour

After reconnecting to the node it should eventually show "synced_to_chain": true. Alternatively (it it's a ZMQ connection issue) I'd expect LND to scream pretty loudly in the log.

Actual behaviour

"synced_to_chain": false indefinitely and we see no new logs of type

[INF] NTFN: New block: height=873198, sha=000000000000000000007b48042479e4f07ce2d6ae9a79c2a3ef5223dc78dd5c
@kallerosenbaum kallerosenbaum added bug Unintended code behaviour needs triage labels Dec 12, 2024
@Roasbeef
Copy link
Member

Roasbeef commented Dec 12, 2024

Are you running with the health check system on? It's meant to catch failures like this, then cause a restart of lnd. It seems like you expect that lnd will resolve the bitcoind host again automatically, but atm we do the resolution once, then use the IP from there on.

Here're the health check params I'm referring to:

; The number of times we should attempt to query our chain backend before
; gracefully shutting down. Set this value to 0 to disable this health check.
; healthcheck.chainbackend.attempts=3

; The amount of time we allow a call to our chain backend to take before we fail
; the attempt. This value must be >= 1s.
; healthcheck.chainbackend.timeout=30s

; The amount of time we should backoff between failed attempts to query chain
; backend. This value must be >= 1s.
; healthcheck.chainbackend.backoff=2m

; The amount of time we should wait between chain backend health checks. This
; value must be >= 1m.
; healthcheck.chainbackend.interval=1m

@kallerosenbaum
Copy link
Author

@Roasbeef yes, it's on, and in production we've set

--healthcheck.chainbackend.attempts=30

And we see the following from healthcheck after restart:


2024-12-04 09:55:59.568 [INF] HLCK: Health check: chain backend, call: 1 failed with: invalid http POST response (nil), method: uptime, id: 1215, last error=Post "http://bitcoin-0.bitcoin.crypto.svc.cluster.local:8332": dial tcp: lookup bitcoin-0.bitcoin.crypto.svc.cluster.local on 169.254.20.10:53: no such host, backing off for: 2m0s
2024-12-04 09:58:22.107 [INF] HLCK: Health check: chain backend, call: 2 failed with: invalid http POST response (nil), method: uptime, id: 1216, last error=Post "http://bitcoin-0.bitcoin.crypto.svc.cluster.local:8332": dial tcp: lookup bitcoin-0.bitcoin.crypto.svc.cluster.local on 169.254.20.10:53: no such host, backing off for: 2m0s
2024-12-04 10:00:44.648 [INF] HLCK: Health check: chain backend, call: 3 failed with: invalid http POST response (nil), method: uptime, id: 1217, last error=Post "http://bitcoin-0.bitcoin.crypto.svc.cluster.local:8332": dial tcp: lookup bitcoin-0.bitcoin.crypto.svc.cluster.local on 169.254.20.10:53: no such host, backing off for: 2m0s

Then it succeeds to connect to the RPC port (in spite of IP address change). So at least RPC can handle an IP address change. My guess is that it's the ZMQ connection that stops working, and the health check doesn't verify that connection. So health check doesn't help here.

@saubyk saubyk added this to the 0.20.0 milestone Dec 19, 2024
@saubyk saubyk added P1 MUST be fixed or reviewed P2 should be fixed if one has time and removed needs triage P1 MUST be fixed or reviewed labels Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour P2 should be fixed if one has time
Projects
None yet
Development

No branches or pull requests

3 participants