Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node stopped synchronizing after usage of the new docker image #74

Open
MirekR opened this issue Jul 30, 2021 · 20 comments
Open

Node stopped synchronizing after usage of the new docker image #74

MirekR opened this issue Jul 30, 2021 · 20 comments

Comments

@MirekR
Copy link

MirekR commented Jul 30, 2021

Hi @maebeam, I'm back with server stopped synchronisation.

After yesterday updated of the docker image to "19c446547510f1e0b83d56611e732e3fa6a0b32d" server stopped syncing, I can see last post 12h ago. It has also started ignoring super_admins, at the moment I can see only new post but not server stats ( I was able to for some time yesterday).

Docker compose file:

version: "3.7"
services:
  backend:
    container_name: backend
    image: docker.io/bitclout/backend:19c446547510f1e0b83d56611e732e3fa6a0b32d
    command: run
    volumes:
    - db:/db
    - ./:/bitclout/run  
    ports:
    - 17001:17001
    - 17000:17000
    env_file:
    - dev.env
    expose:
    - "17001"
    - "17000"
  frontend:
    container_name: frontend 
    image: docker.io/bitclout/frontend:23d22a586e70b2f6700f01ab4feabe98e53ea991
    ports:
    - 8080:8080
    volumes:
    - ./:/app
    env_file:
    - dev.env
    expose:
    - "8080"
  nginx: 
    container_name: nginx
    image: nginx:latest
    command: "/bin/sh -c 'while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g \"daemon off;\"'"
    volumes:
      - ./nginx.dev:/etc/nginx/nginx.conf
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
    depends_on: 
      - backend
      - frontend
    ports:
      - 80:80
      - 443:443
  certbot:
    image: certbot/certbot
    entrypoint: "/bin/sh -c 'trap exit TERM; while :; do certbot renew; sleep 12h & wait $${!}; done;'"
    volumes:
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
volumes:
  db:

Full logs attached.
full.log

@tijno
Copy link
Contributor

tijno commented Jul 30, 2021

@MirekR anything in dmesg like running out of memory, too many open files or disc space?

Main reason i have seen nodes crash or stop syncing is for those 2

@tijno
Copy link
Contributor

tijno commented Jul 30, 2021

Ah looking at your logs you did not clear the data in /db volume. As per the changelog on backend you need to clear and resync.

@maebeam
Copy link
Contributor

maebeam commented Jul 30, 2021

Please reopen this if wiping the /db volume does not resolve.

@maebeam maebeam closed this as completed Jul 30, 2021
@marnimelrose
Copy link

I did clear the db and I am stuck, should we re-open?

@maebeam
Copy link
Contributor

maebeam commented Jul 30, 2021

Sure

@maebeam maebeam reopened this Jul 30, 2021
@MirekR
Copy link
Author

MirekR commented Aug 2, 2021

We shouldn't need to wipe /db volume everytime change comes in, it kills the global feed as well.

@tijno
Copy link
Contributor

tijno commented Aug 2, 2021

@MikeKR We haven't had to - before NFTs mine has not had to resync for 2 months.

Also the main issue with resync is not so much the blocks - they happen relatively quickly.

Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.

@MirekR
Copy link
Author

MirekR commented Aug 2, 2021

Performance of re-sync is one question, loosing nodes global feed is another and from the end user perspective more serious issue.

@tijno
Copy link
Contributor

tijno commented Aug 2, 2021

You can keep the global feed in 2 ways:

1 - dont delete the whole db - eg make sure you keep the one in /db/badgerdb/globalstate or

2 - use the config option to load global state from a central node

# The IP:PORT or DOMAIN:PORT corresponding to a node that can be used to
# set/get global state. When this is not provided, global state is set/fetched
# from a local DB. Global state is used to manage things like user data, e.g.
# emails, that should not be duplicated across multiple nodes.
GLOBAL_STATE_REMOTE_NODE=

@MirekR
Copy link
Author

MirekR commented Aug 2, 2021

Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.

Do I read this correctly as "we'll need to re-sync again soon and loose global again"?

@tijno
Copy link
Contributor

tijno commented Aug 2, 2021

@MirekR Im sure the upgrade can be done in such a way that you keep global state.

And note my reply above about ways to not loose global state.

@maebeam
Copy link
Contributor

maebeam commented Aug 2, 2021

We're in the process of the backing store to postgres which will make these updates much less painful and more efficient. Global state migration tools will be provided.

@addanus
Copy link

addanus commented Sep 11, 2021

@maebeam may I ask where we are with this?

And... has anyone considered a postgres docker image for a faster sync - at least at some set block height?

I was having this reorg issue as recently as last weekend (9/4/21) - deso-protocol/core#98

@maebeam
Copy link
Contributor

maebeam commented Sep 11, 2021

Postgres is in beta but not ready to replace badger yet. What errors are you seeing? I haven’t seen any other reports of this issue recently

@addanus
Copy link

addanus commented Sep 12, 2021

Sorry for the delay... It's taking a long time to get anywhere - after a complete rebuild and resync.

I get this often - but it's understandable... and at least still moving forward.

E0912 07:27:06.327239 1 peer.go:142] AddBitCloutMessage: Not enqueueing message GET_BLOCKS because peer is disconnecting

I0912 07:27:06.327286 1 server.go:1131] Server._handleBlock: Received block ( 11278 / 59719 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=2 ]

It's been 7 hours and still only at 11k - but like I said it's moving.

Plus it turns out I was getting the "reorg" error loop at around 9427 - so at least we are beyond that:

I0906 21:14:55.712786 1 server.go:1131] Server._handleBlock: Received block ( 9427 / 58162 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]

E0906 21:14:55.733306 1 server.go:1125] Server._handleBlock: Encountered an error processing block <Header: < 9427, 00000000009c5230c9fbce0c6369633d751a445f9bc35d3448390821ae7eb2dd, 0 >, Signer Key: NONE>. Disconnecting from peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]: Error while processing block: : ProcessBlock: Problem fetching block (< TstampSecs: 1616574709, Height: 9313, Hash: 00000000003f6fa554fb4d68eddf9b5c55809bf419c7c458ead7625d4e759d2f, ParentHash 00000000001a0d631b5a863482e1857f5d96aeda65250f2486824e73b57af068, Status: HEADER_VALIDATED | BLOCK_PROCESSED | BLOCK_STORED, CumWork: 8311532279091880>) during attach in reorg: Key not found

I'll keep it going and see what develops

@maebeam
Copy link
Contributor

maebeam commented Sep 12, 2021

Do you have a fast and reliable internet connection? I've only seen this type of behavior on low bandwidth / spotty internet.

@addanus
Copy link

addanus commented Sep 12, 2021

I have a fast connection, but also this is running on google cloud platform

e2-standard-8 (8 vCPUs, 32 GB memory) ubuntu 20.04 with static IP

@addanus
Copy link

addanus commented Sep 13, 2021

Following up

  1. started totally new on a new server instance
  2. readonly set to true + admin and super-admin keys set (similar to before)
  3. the sync speed was improved - got to 10k mark within 2 hours.
  4. ran into the reorg issue at 18k - after a reboot (during the sync of course). It just kept looping over a few blocks with failure... disconnecting...
  5. completely started over using: docker-compose down -f file.yml --remove-orphans --volumes; docker image purge -a; ./run.sh -d
  6. left it alone and now fully sync'd - in less than 12 hours

@maebeam
Copy link
Contributor

maebeam commented Sep 14, 2021

Very weird. You may have faster syncing luck using larger SSD volumes. IOPS are determined by disk size on GCP

@addanus
Copy link

addanus commented Sep 14, 2021

lol. I thought 12 was fast, but even that is a barrier tbf.

This latest setup was on a 250 GB SSD - but I hadn't considered the IOPS rate. Good point

One day when nodes can go from 0 to sync in 5 minutes, devs will be able to focus on building a business instead of maintaining a node. And I'm sure us early adopters will profit by extension 😺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants