celestia node grabbing excessive RAM #3129

joroshiba · 2024-01-22T19:49:53Z

Celestia Node version

0.12.3

OS

Alpine 3.18.4

Install tools

Using the ghcr docker container in k8s. This is deployed with our helm chart: https://github.com/astriaorg/dev-cluster/tree/main/charts/celestia-node, utilizing an override which provides a PVC for storage.

k8s statefulset file:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: astria-celestia-node-light-mocha-4
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: OrderedReady
  replicas: 1
  selector:
    matchLabels:
      app: astria-celestia-node-light-mocha-4
  serviceName: ''
  template:
    metadata:
      labels:
        app: astria-celestia-node-light-mocha-4
      name: astria-celestia-node-light-mocha-4
    spec:
      containers:
        - command:
            - ./celestia/scripts/start-node.sh
          image: 'ghcr.io/celestiaorg/celestia-node:v0.12.3'
          imagePullPolicy: IfNotPresent
          name: astria-celestia-node-light-mocha-4
          ports:
            - containerPort: 26658
              name: rpc
              protocol: TCP
          resources:
            limits:
              cpu: '2'
              memory: 25Gi
            requests:
              cpu: '1'
              memory: 8Gi
          securityContext:
            runAsGroup: 10001
            runAsUser: 10001
          volumeMounts:
            - mountPath: /celestia/scripts
              name: astria-celestia-node-light-mocha-4-scripts-vol
            - mountPath: /celestia
              name: astria-celestia-node-light-mocha-4-vol
            - mountPath: /celestia/config.toml
              name: astria-celestia-node-light-mocha-4-files-volume
              subPath: config.toml
        - command:
            - /bin/httpd
            - '-v'
            - '-f'
            - '-p'
            - '5353'
            - '-h'
            - /celestia/token-server/
          image: 'busybox:1.35.0-musl'
          imagePullPolicy: IfNotPresent
          name: token-server
          ports:
            - containerPort: 5353
              name: token-svc
              protocol: TCP
          resources: {}
          startupProbe:
            failureThreshold: 30
            httpGet:
              path: /
              port: token-svc
              scheme: HTTP
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /celestia
              name: astria-celestia-node-light-mocha-4-vol
              readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
        - args:
            - '--node.store'
            - /celestia
          command:
            - /bin/celestia
            - light
            - init
          image: 'ghcr.io/celestiaorg/celestia-node:v0.12.3'
          imagePullPolicy: IfNotPresent
          name: init-astria-celestia-node-light-mocha-4
          volumeMounts:
            - mountPath: /celestia
              name: astria-celestia-node-light-mocha-4-vol
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 10001
        fsGroupChangePolicy: OnRootMismatch
        runAsUser: 10001
      terminationGracePeriodSeconds: 30
      volumes:
        - configMap:
            defaultMode: 484
            name: astria-celestia-node-light-mocha-4-scripts-env
          name: astria-celestia-node-light-mocha-4-scripts-vol
        - configMap:
            defaultMode: 420
            name: astria-celestia-node-light-mocha-4-files-env
          name: astria-celestia-node-light-mocha-4-files-volume
        - name: astria-celestia-node-light-mocha-4-vol
          persistentVolumeClaim:
            claimName: astria-celestia-node-light-mocha-4-storage-pvc
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate

Others

We have played around with the resources allocated to the light node, recently raising to 25GB of ram available, it consumes all available resources and then is killed to maintain safety. Can see init and commands in the above k8s stateful set.

Steps to reproduce it

Deploy using helm chart w/ pvc configured. Very consistenly overconsimes

Expected result

The node to not consume 25GB of ram and ideally be functional on stated minimum system requirements (500 MB?)

Actual result

Can see graph of memory usage in GB of the node with each restart noted with different colors below

Relevant log output

No response

Notes

No response

The text was updated successfully, but these errors were encountered:

Wondertan · 2024-01-22T23:05:35Z

We were recently debugging a case reported by @mycodecrafting where a node similarly grabbed a lot of RAM but wasn't using it(as per profiles), and the kernel could still reclaim that memory, as we proved in an experiment. In htop we saw it taking 25G, but once we launched another memory-heavy process - the node quickly shrank to around 1G.

As we are not aware of any other leaks, I would like to first exclude the above. The only difference I see is that in your case, the node gets killed or OOMed, while in the above, everything was ok, but I need more information on how it gets killed, like logs from k8s. Additionally, I need profiles from the node, and this will definitely confirm if this issue is related to #3107

MSevey · 2024-01-23T20:15:53Z

If this is specific to launching new nodes that need to sync, app has a similar open issue that devops sees.

celestiaorg/celestia-app#2935

Just an fyi, not sure if there could be any common issues between the reports.

Wondertan · 2024-01-23T20:30:05Z

@MSevey, are there similar reports for the node or only for app?

MSevey · 2024-01-24T14:14:27Z

@MSevey, are there similar reports for the node or only for app?

currently just app to my knowledge. But thought it might be useful to touch base with app to see if anything they looked into trigger new ideas here.

ramin · 2024-03-08T10:54:13Z

closing as i believe we identified this as a thundering herd that hit the running node

joroshiba added the bug Something isn't working label Jan 22, 2024

github-actions bot added the external Issues created by non node team members label Jan 22, 2024

Wondertan self-assigned this Jan 22, 2024

Wondertan mentioned this issue Jan 24, 2024

RPC rate limitting #3136

Closed

ramin closed this as completed Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

celestia node grabbing excessive RAM #3129

celestia node grabbing excessive RAM #3129

joroshiba commented Jan 22, 2024

Wondertan commented Jan 22, 2024 •

edited

Loading

MSevey commented Jan 23, 2024

Wondertan commented Jan 23, 2024

MSevey commented Jan 24, 2024

ramin commented Mar 8, 2024

celestia node grabbing excessive RAM #3129

celestia node grabbing excessive RAM #3129

Comments

joroshiba commented Jan 22, 2024

Celestia Node version

OS

Install tools

Others

Steps to reproduce it

Expected result

Actual result

Relevant log output

Notes

Wondertan commented Jan 22, 2024 • edited Loading

MSevey commented Jan 23, 2024

Wondertan commented Jan 23, 2024

MSevey commented Jan 24, 2024

ramin commented Mar 8, 2024

Wondertan commented Jan 22, 2024 •

edited

Loading