Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

celestia node grabbing excessive RAM #3129

Closed
joroshiba opened this issue Jan 22, 2024 · 5 comments
Closed

celestia node grabbing excessive RAM #3129

joroshiba opened this issue Jan 22, 2024 · 5 comments
Assignees
Labels
bug Something isn't working external Issues created by non node team members

Comments

@joroshiba
Copy link

Celestia Node version

0.12.3

OS

Alpine 3.18.4

Install tools

Using the ghcr docker container in k8s. This is deployed with our helm chart: https://github.com/astriaorg/dev-cluster/tree/main/charts/celestia-node, utilizing an override which provides a PVC for storage.

k8s statefulset file:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: astria-celestia-node-light-mocha-4
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: OrderedReady
  replicas: 1
  selector:
    matchLabels:
      app: astria-celestia-node-light-mocha-4
  serviceName: ''
  template:
    metadata:
      labels:
        app: astria-celestia-node-light-mocha-4
      name: astria-celestia-node-light-mocha-4
    spec:
      containers:
        - command:
            - ./celestia/scripts/start-node.sh
          image: 'ghcr.io/celestiaorg/celestia-node:v0.12.3'
          imagePullPolicy: IfNotPresent
          name: astria-celestia-node-light-mocha-4
          ports:
            - containerPort: 26658
              name: rpc
              protocol: TCP
          resources:
            limits:
              cpu: '2'
              memory: 25Gi
            requests:
              cpu: '1'
              memory: 8Gi
          securityContext:
            runAsGroup: 10001
            runAsUser: 10001
          volumeMounts:
            - mountPath: /celestia/scripts
              name: astria-celestia-node-light-mocha-4-scripts-vol
            - mountPath: /celestia
              name: astria-celestia-node-light-mocha-4-vol
            - mountPath: /celestia/config.toml
              name: astria-celestia-node-light-mocha-4-files-volume
              subPath: config.toml
        - command:
            - /bin/httpd
            - '-v'
            - '-f'
            - '-p'
            - '5353'
            - '-h'
            - /celestia/token-server/
          image: 'busybox:1.35.0-musl'
          imagePullPolicy: IfNotPresent
          name: token-server
          ports:
            - containerPort: 5353
              name: token-svc
              protocol: TCP
          resources: {}
          startupProbe:
            failureThreshold: 30
            httpGet:
              path: /
              port: token-svc
              scheme: HTTP
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /celestia
              name: astria-celestia-node-light-mocha-4-vol
              readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
        - args:
            - '--node.store'
            - /celestia
          command:
            - /bin/celestia
            - light
            - init
          image: 'ghcr.io/celestiaorg/celestia-node:v0.12.3'
          imagePullPolicy: IfNotPresent
          name: init-astria-celestia-node-light-mocha-4
          volumeMounts:
            - mountPath: /celestia
              name: astria-celestia-node-light-mocha-4-vol
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 10001
        fsGroupChangePolicy: OnRootMismatch
        runAsUser: 10001
      terminationGracePeriodSeconds: 30
      volumes:
        - configMap:
            defaultMode: 484
            name: astria-celestia-node-light-mocha-4-scripts-env
          name: astria-celestia-node-light-mocha-4-scripts-vol
        - configMap:
            defaultMode: 420
            name: astria-celestia-node-light-mocha-4-files-env
          name: astria-celestia-node-light-mocha-4-files-volume
        - name: astria-celestia-node-light-mocha-4-vol
          persistentVolumeClaim:
            claimName: astria-celestia-node-light-mocha-4-storage-pvc
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate

Others

We have played around with the resources allocated to the light node, recently raising to 25GB of ram available, it consumes all available resources and then is killed to maintain safety. Can see init and commands in the above k8s stateful set.

Steps to reproduce it

Deploy using helm chart w/ pvc configured. Very consistenly overconsimes

Expected result

The node to not consume 25GB of ram and ideally be functional on stated minimum system requirements (500 MB?)

Actual result

Can see graph of memory usage in GB of the node with each restart noted with different colors below
image

Relevant log output

No response

Notes

No response

@joroshiba joroshiba added the bug Something isn't working label Jan 22, 2024
@github-actions github-actions bot added the external Issues created by non node team members label Jan 22, 2024
@Wondertan Wondertan self-assigned this Jan 22, 2024
@Wondertan
Copy link
Member

Wondertan commented Jan 22, 2024

We were recently debugging a case reported by @mycodecrafting where a node similarly grabbed a lot of RAM but wasn't using it(as per profiles), and the kernel could still reclaim that memory, as we proved in an experiment. In htop we saw it taking 25G, but once we launched another memory-heavy process - the node quickly shrank to around 1G.

As we are not aware of any other leaks, I would like to first exclude the above. The only difference I see is that in your case, the node gets killed or OOMed, while in the above, everything was ok, but I need more information on how it gets killed, like logs from k8s. Additionally, I need profiles from the node, and this will definitely confirm if this issue is related to #3107

@MSevey
Copy link
Member

MSevey commented Jan 23, 2024

If this is specific to launching new nodes that need to sync, app has a similar open issue that devops sees.

celestiaorg/celestia-app#2935

Just an fyi, not sure if there could be any common issues between the reports.

@Wondertan
Copy link
Member

@MSevey, are there similar reports for the node or only for app?

@MSevey
Copy link
Member

MSevey commented Jan 24, 2024

@MSevey, are there similar reports for the node or only for app?

currently just app to my knowledge. But thought it might be useful to touch base with app to see if anything they looked into trigger new ideas here.

@ramin
Copy link
Contributor

ramin commented Mar 8, 2024

closing as i believe we identified this as a thundering herd that hit the running node

@ramin ramin closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Issues created by non node team members
Projects
None yet
Development

No branches or pull requests

4 participants