Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes on self-hosted with panic: runtime error: integer divide by zero #182

Open
orhun opened this issue Aug 9, 2024 · 2 comments
Open

Comments

@orhun
Copy link
Contributor

orhun commented Aug 9, 2024

My setup is the following:

  • Anteon Self Hosted running via Docker compose
  • Kubernetes running via microk8s
  • Alaz installed via kubectl, fails with the following error:
$ kubectl logs -n anteon alaz-daemonset-sskqx

{"level":"info","tag":"v0.11.3","time":1723187890,"message":"alaz tag"}
{"level":"info","time":1723187890,"message":"k8sCollector initializing..."}
{"level":"info","time":1723187890,"message":"Connected successfully to CRI using endpoint unix:///proc/1/root/run/containerd/containerd.sock"}
panic: runtime error: integer divide by zero

goroutine 47 [running]:
github.com/ddosify/alaz/aggregator.(*ClusterInfo).handleSocketMapCreation(0xc0002dc5b0)
	/app/aggregator/cluster.go:89 +0x33d
created by github.com/ddosify/alaz/aggregator.newClusterInfo in goroutine 1
	/app/aggregator/cluster.go:59 +0x1a9
kubectl describe pod -n anteon alaz-daemonset-sskqx
Name:             alaz-daemonset-sskqx
Namespace:        anteon
Priority:         0
Service Account:  alaz-serviceaccount
Node:             thinkpad/192.168.1.38
Start Time:       Fri, 09 Aug 2024 10:01:44 +0300
Labels:           app=alaz
                  controller-revision-hash=6f9d87bfc4
                  pod-template-generation=1
Annotations:      cni.projectcalico.org/containerID: 003a6554ea84ff581daee5b353ccf9b6619a8febdb6302ce34a566764f0e45f3
                  cni.projectcalico.org/podIP: 10.1.19.183/32
                  cni.projectcalico.org/podIPs: 10.1.19.183/32
Status:           Running
IP:               10.1.19.183
IPs:
  IP:           10.1.19.183
Controlled By:  DaemonSet/alaz-daemonset
Containers:
  alaz-pod:
    Container ID:  containerd://c6c904add2264b0016798d11550f2ff05e683fe713c681c3f3a415e31de9f07c
    Image:         ddosify/alaz:v0.11.3
    Image ID:      docker.io/ddosify/alaz@sha256:08dbbb8ba337ce340a8ba8800e710ff5a2df9612ea258cdc472867ea0bb97224
    Port:          8181/TCP
    Host Port:     0/TCP
    Args:
      --no-collector.wifi
      --no-collector.hwmon
      --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
      --collector.netclass.ignored-devices=^(veth.*)$
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 09 Aug 2024 10:18:10 +0300
      Finished:     Fri, 09 Aug 2024 10:18:11 +0300
    Ready:          False
    Restart Count:  8
    Limits:
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  400Mi
    Environment:
      TRACING_ENABLED:             true
      METRICS_ENABLED:             true
      LOGS_ENABLED:                false
      BACKEND_HOST:                http://bore.pub:39548/api-alaz
      LOG_LEVEL:                   1
      MONITORING_ID:               7c6a484a-ec47-46a6-946d-4071ff6cf883
      SEND_ALIVE_TCP_CONNECTIONS:  false
      NODE_NAME:                    (v1:spec.nodeName)
    Mounts:
      /sys/kernel/debug from debugfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df6xh (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  debugfs:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/kernel/debug
    HostPathType:  
  kube-api-access-df6xh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  3m54s (x68 over 18m)  kubelet  Back-off restarting failed container alaz-pod in pod alaz-daemonset-sskqx_anteon(a3d74951-574e-4149-8db3-9749a627f5fd)
alaz.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: alaz-serviceaccount
  namespace: anteon
---
# For alaz to keep track of changes in cluster
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: alaz-role
  namespace: anteon
rules:
- apiGroups:
  - "*"
  resources:
  - pods
  - services
  - endpoints
  - replicasets
  - deployments
  - daemonsets
  - statefulsets
  verbs:
  - "get"
  - "list"
  - "watch"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: alaz-role-binding
  namespace: anteon
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: alaz-role
subjects:
- kind: ServiceAccount
  name: alaz-serviceaccount
  namespace: anteon
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: alaz-daemonset
  namespace: anteon
spec:
  selector:
    matchLabels:
      app: alaz
  template:
    metadata:
      labels:
        app: alaz
    spec:
      hostPID: true
      containers:
      - env:
        - name: TRACING_ENABLED
          value: "true"
        - name: METRICS_ENABLED
          value: "true"
        - name: LOGS_ENABLED
          value: "false"
        - name: BACKEND_HOST
          value: http://bore.pub:39548/api-alaz
        - name: LOG_LEVEL
          value: "1"
        # - name: EXCLUDE_NAMESPACES
        #   value: "^anteon.*"
        - name: MONITORING_ID
          value: 7c6a484a-ec47-46a6-946d-4071ff6cf883
        - name: SEND_ALIVE_TCP_CONNECTIONS  # Send undetected protocol connections (unknown connections)
          value: "false"
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        args:
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        image: ddosify/alaz:v0.11.3
        imagePullPolicy: IfNotPresent
        name: alaz-pod
        ports:
        - containerPort: 8181
          protocol: TCP
        resources:
          limits:
            memory: 1Gi
          requests:
            cpu: "1"
            memory: 400Mi
        securityContext:
          privileged: true 
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        # needed for linking ebpf trace programs
        volumeMounts:
        - mountPath: /sys/kernel/debug
          name: debugfs
          readOnly: false
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: alaz-serviceaccount
      serviceAccountName: alaz-serviceaccount
      terminationGracePeriodSeconds: 30
      # needed for linking ebpf trace programs
      volumes:
      - name: debugfs
        hostPath:
          path: /sys/kernel/debug

Only thing that I did different compared to the documentation was using bore.pub instead of ngrok which shouldn't be a problem I think.

I'm running Arch Linux with the kernel 6.10.1-arch1-1.

@orhun
Copy link
Contributor Author

orhun commented Aug 12, 2024

I'm getting the same issue when I deploy via Helm chart as well:

{"level":"info","tag":"v0.12.0","time":1723477886,"message":"alaz tag"}
{"level":"info","time":1723477886,"message":"k8sCollector initializing..."}
{"level":"info","time":1723477886,"message":"Connected successfully to CRI using endpoint unix:///proc/1/root/run/containerd/containerd.sock"}
{"level":"error","time":1723477887,"message":"error creating gpu collector: failed to load nvidia driver: <nil>"}
{"level":"error","time":1723477887,"message":"error exporting gpu metrics: failed to load nvidia driver: <nil>"}
panic: runtime error: integer divide by zero

goroutine 85 [running]:
github.com/ddosify/alaz/aggregator.(*ClusterInfo).handleSocketMapCreation(0xc0002fcd90)
	/app/aggregator/cluster.go:89 +0x33d
created by github.com/ddosify/alaz/aggregator.newClusterInfo in goroutine 1
	/app/aggregator/cluster.go:59 +0x1a9

@orhun
Copy link
Contributor Author

orhun commented Aug 12, 2024

I guess there is a race condition on this line:

i := (ci.muIndex.Load()) % uint64(len(ci.muArray))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant