Node stopped synchronizing after usage of the new docker image #74

MirekR · 2021-07-30T10:00:33Z

Hi @maebeam, I'm back with server stopped synchronisation.

After yesterday updated of the docker image to "19c446547510f1e0b83d56611e732e3fa6a0b32d" server stopped syncing, I can see last post 12h ago. It has also started ignoring super_admins, at the moment I can see only new post but not server stats ( I was able to for some time yesterday).

Docker compose file:

version: "3.7"
services:
  backend:
    container_name: backend
    image: docker.io/bitclout/backend:19c446547510f1e0b83d56611e732e3fa6a0b32d
    command: run
    volumes:
    - db:/db
    - ./:/bitclout/run  
    ports:
    - 17001:17001
    - 17000:17000
    env_file:
    - dev.env
    expose:
    - "17001"
    - "17000"
  frontend:
    container_name: frontend 
    image: docker.io/bitclout/frontend:23d22a586e70b2f6700f01ab4feabe98e53ea991
    ports:
    - 8080:8080
    volumes:
    - ./:/app
    env_file:
    - dev.env
    expose:
    - "8080"
  nginx: 
    container_name: nginx
    image: nginx:latest
    command: "/bin/sh -c 'while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g \"daemon off;\"'"
    volumes:
      - ./nginx.dev:/etc/nginx/nginx.conf
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
    depends_on: 
      - backend
      - frontend
    ports:
      - 80:80
      - 443:443
  certbot:
    image: certbot/certbot
    entrypoint: "/bin/sh -c 'trap exit TERM; while :; do certbot renew; sleep 12h & wait $${!}; done;'"
    volumes:
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
volumes:
  db:

Full logs attached.
full.log

The text was updated successfully, but these errors were encountered:

tijno · 2021-07-30T14:59:33Z

@MirekR anything in dmesg like running out of memory, too many open files or disc space?

Main reason i have seen nodes crash or stop syncing is for those 2

tijno · 2021-07-30T15:02:22Z

Ah looking at your logs you did not clear the data in /db volume. As per the changelog on backend you need to clear and resync.

maebeam · 2021-07-30T17:15:21Z

Please reopen this if wiping the /db volume does not resolve.

marnimelrose · 2021-07-30T17:40:13Z

I did clear the db and I am stuck, should we re-open?

maebeam · 2021-07-30T17:51:04Z

Sure

MirekR · 2021-08-02T08:39:13Z

We shouldn't need to wipe /db volume everytime change comes in, it kills the global feed as well.

tijno · 2021-08-02T15:50:56Z

@MikeKR We haven't had to - before NFTs mine has not had to resync for 2 months.

Also the main issue with resync is not so much the blocks - they happen relatively quickly.

Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.

MirekR · 2021-08-02T16:03:38Z

Performance of re-sync is one question, loosing nodes global feed is another and from the end user perspective more serious issue.

tijno · 2021-08-02T16:05:43Z

You can keep the global feed in 2 ways:

1 - dont delete the whole db - eg make sure you keep the one in /db/badgerdb/globalstate or

2 - use the config option to load global state from a central node

# The IP:PORT or DOMAIN:PORT corresponding to a node that can be used to
# set/get global state. When this is not provided, global state is set/fetched
# from a local DB. Global state is used to manage things like user data, e.g.
# emails, that should not be duplicated across multiple nodes.
GLOBAL_STATE_REMOTE_NODE=

MirekR · 2021-08-02T16:05:46Z

Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.

Do I read this correctly as "we'll need to re-sync again soon and loose global again"?

tijno · 2021-08-02T16:07:11Z

@MirekR Im sure the upgrade can be done in such a way that you keep global state.

And note my reply above about ways to not loose global state.

maebeam · 2021-08-02T16:51:52Z

We're in the process of the backing store to postgres which will make these updates much less painful and more efficient. Global state migration tools will be provided.

addanus · 2021-09-11T03:19:03Z

@maebeam may I ask where we are with this?

And... has anyone considered a postgres docker image for a faster sync - at least at some set block height?

I was having this reorg issue as recently as last weekend (9/4/21) - deso-protocol/core#98

maebeam · 2021-09-11T22:14:15Z

Postgres is in beta but not ready to replace badger yet. What errors are you seeing? I haven’t seen any other reports of this issue recently

addanus · 2021-09-12T07:43:02Z

Sorry for the delay... It's taking a long time to get anywhere - after a complete rebuild and resync.

I get this often - but it's understandable... and at least still moving forward.

E0912 07:27:06.327239 1 peer.go:142] AddBitCloutMessage: Not enqueueing message GET_BLOCKS because peer is disconnecting

I0912 07:27:06.327286 1 server.go:1131] Server._handleBlock: Received block ( 11278 / 59719 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=2 ]

It's been 7 hours and still only at 11k - but like I said it's moving.

Plus it turns out I was getting the "reorg" error loop at around 9427 - so at least we are beyond that:

I0906 21:14:55.712786 1 server.go:1131] Server._handleBlock: Received block ( 9427 / 58162 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]

E0906 21:14:55.733306 1 server.go:1125] Server._handleBlock: Encountered an error processing block <Header: < 9427, 00000000009c5230c9fbce0c6369633d751a445f9bc35d3448390821ae7eb2dd, 0 >, Signer Key: NONE>. Disconnecting from peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]: Error while processing block: : ProcessBlock: Problem fetching block (< TstampSecs: 1616574709, Height: 9313, Hash: 00000000003f6fa554fb4d68eddf9b5c55809bf419c7c458ead7625d4e759d2f, ParentHash 00000000001a0d631b5a863482e1857f5d96aeda65250f2486824e73b57af068, Status: HEADER_VALIDATED | BLOCK_PROCESSED | BLOCK_STORED, CumWork: 8311532279091880>) during attach in reorg: Key not found

I'll keep it going and see what develops

maebeam · 2021-09-12T08:03:02Z

Do you have a fast and reliable internet connection? I've only seen this type of behavior on low bandwidth / spotty internet.

addanus · 2021-09-12T08:10:39Z

I have a fast connection, but also this is running on google cloud platform

e2-standard-8 (8 vCPUs, 32 GB memory) ubuntu 20.04 with static IP

addanus · 2021-09-13T16:22:35Z

Following up

started totally new on a new server instance
readonly set to true + admin and super-admin keys set (similar to before)
the sync speed was improved - got to 10k mark within 2 hours.
ran into the reorg issue at 18k - after a reboot (during the sync of course). It just kept looping over a few blocks with failure... disconnecting...
completely started over using: docker-compose down -f file.yml --remove-orphans --volumes; docker image purge -a; ./run.sh -d
left it alone and now fully sync'd - in less than 12 hours

maebeam · 2021-09-14T13:02:00Z

Very weird. You may have faster syncing luck using larger SSD volumes. IOPS are determined by disk size on GCP

addanus · 2021-09-14T14:16:07Z

lol. I thought 12 was fast, but even that is a barrier tbf.

This latest setup was on a 250 GB SSD - but I hadn't considered the IOPS rate. Good point

One day when nodes can go from 0 to sync in 5 minutes, devs will be able to focus on building a business instead of maintaining a node. And I'm sure us early adopters will profit by extension 😺

maebeam closed this as completed Jul 30, 2021

maebeam reopened this Jul 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node stopped synchronizing after usage of the new docker image #74

Node stopped synchronizing after usage of the new docker image #74

MirekR commented Jul 30, 2021

tijno commented Jul 30, 2021

tijno commented Jul 30, 2021

maebeam commented Jul 30, 2021

marnimelrose commented Jul 30, 2021

maebeam commented Jul 30, 2021

MirekR commented Aug 2, 2021

tijno commented Aug 2, 2021

MirekR commented Aug 2, 2021

tijno commented Aug 2, 2021 •

edited

Loading

MirekR commented Aug 2, 2021

tijno commented Aug 2, 2021

maebeam commented Aug 2, 2021

addanus commented Sep 11, 2021 •

edited

Loading

maebeam commented Sep 11, 2021

addanus commented Sep 12, 2021 •

edited

Loading

maebeam commented Sep 12, 2021

addanus commented Sep 12, 2021

addanus commented Sep 13, 2021

maebeam commented Sep 14, 2021

addanus commented Sep 14, 2021

Node stopped synchronizing after usage of the new docker image #74

Node stopped synchronizing after usage of the new docker image #74

Comments

MirekR commented Jul 30, 2021

tijno commented Jul 30, 2021

tijno commented Jul 30, 2021

maebeam commented Jul 30, 2021

marnimelrose commented Jul 30, 2021

maebeam commented Jul 30, 2021

MirekR commented Aug 2, 2021

tijno commented Aug 2, 2021

MirekR commented Aug 2, 2021

tijno commented Aug 2, 2021 • edited Loading

MirekR commented Aug 2, 2021

tijno commented Aug 2, 2021

maebeam commented Aug 2, 2021

addanus commented Sep 11, 2021 • edited Loading

maebeam commented Sep 11, 2021

addanus commented Sep 12, 2021 • edited Loading

maebeam commented Sep 12, 2021

addanus commented Sep 12, 2021

addanus commented Sep 13, 2021

maebeam commented Sep 14, 2021

addanus commented Sep 14, 2021

tijno commented Aug 2, 2021 •

edited

Loading

addanus commented Sep 11, 2021 •

edited

Loading

addanus commented Sep 12, 2021 •

edited

Loading