Crashes on self-hosted with panic: runtime error: integer divide by zero #182

orhun · 2024-08-09T07:23:14Z

My setup is the following:

Anteon Self Hosted running via Docker compose
Kubernetes running via microk8s
Alaz installed via kubectl, fails with the following error:

$ kubectl logs -n anteon alaz-daemonset-sskqx

{"level":"info","tag":"v0.11.3","time":1723187890,"message":"alaz tag"}
{"level":"info","time":1723187890,"message":"k8sCollector initializing..."}
{"level":"info","time":1723187890,"message":"Connected successfully to CRI using endpoint unix:///proc/1/root/run/containerd/containerd.sock"}
panic: runtime error: integer divide by zero

goroutine 47 [running]:
github.com/ddosify/alaz/aggregator.(*ClusterInfo).handleSocketMapCreation(0xc0002dc5b0)
	/app/aggregator/cluster.go:89 +0x33d
created by github.com/ddosify/alaz/aggregator.newClusterInfo in goroutine 1
	/app/aggregator/cluster.go:59 +0x1a9

kubectl describe pod -n anteon alaz-daemonset-sskqx

Name:             alaz-daemonset-sskqx
Namespace:        anteon
Priority:         0
Service Account:  alaz-serviceaccount
Node:             thinkpad/192.168.1.38
Start Time:       Fri, 09 Aug 2024 10:01:44 +0300
Labels:           app=alaz
                  controller-revision-hash=6f9d87bfc4
                  pod-template-generation=1
Annotations:      cni.projectcalico.org/containerID: 003a6554ea84ff581daee5b353ccf9b6619a8febdb6302ce34a566764f0e45f3
                  cni.projectcalico.org/podIP: 10.1.19.183/32
                  cni.projectcalico.org/podIPs: 10.1.19.183/32
Status:           Running
IP:               10.1.19.183
IPs:
  IP:           10.1.19.183
Controlled By:  DaemonSet/alaz-daemonset
Containers:
  alaz-pod:
    Container ID:  containerd://c6c904add2264b0016798d11550f2ff05e683fe713c681c3f3a415e31de9f07c
    Image:         ddosify/alaz:v0.11.3
    Image ID:      docker.io/ddosify/alaz@sha256:08dbbb8ba337ce340a8ba8800e710ff5a2df9612ea258cdc472867ea0bb97224
    Port:          8181/TCP
    Host Port:     0/TCP
    Args:
      --no-collector.wifi
      --no-collector.hwmon
      --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
      --collector.netclass.ignored-devices=^(veth.*)$
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 09 Aug 2024 10:18:10 +0300
      Finished:     Fri, 09 Aug 2024 10:18:11 +0300
    Ready:          False
    Restart Count:  8
    Limits:
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  400Mi
    Environment:
      TRACING_ENABLED:             true
      METRICS_ENABLED:             true
      LOGS_ENABLED:                false
      BACKEND_HOST:                http://bore.pub:39548/api-alaz
      LOG_LEVEL:                   1
      MONITORING_ID:               7c6a484a-ec47-46a6-946d-4071ff6cf883
      SEND_ALIVE_TCP_CONNECTIONS:  false
      NODE_NAME:                    (v1:spec.nodeName)
    Mounts:
      /sys/kernel/debug from debugfs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df6xh (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  debugfs:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/kernel/debug
    HostPathType:  
  kube-api-access-df6xh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  3m54s (x68 over 18m)  kubelet  Back-off restarting failed container alaz-pod in pod alaz-daemonset-sskqx_anteon(a3d74951-574e-4149-8db3-9749a627f5fd)

alaz.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: alaz-serviceaccount
  namespace: anteon
---
# For alaz to keep track of changes in cluster
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: alaz-role
  namespace: anteon
rules:
- apiGroups:
  - "*"
  resources:
  - pods
  - services
  - endpoints
  - replicasets
  - deployments
  - daemonsets
  - statefulsets
  verbs:
  - "get"
  - "list"
  - "watch"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: alaz-role-binding
  namespace: anteon
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: alaz-role
subjects:
- kind: ServiceAccount
  name: alaz-serviceaccount
  namespace: anteon
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: alaz-daemonset
  namespace: anteon
spec:
  selector:
    matchLabels:
      app: alaz
  template:
    metadata:
      labels:
        app: alaz
    spec:
      hostPID: true
      containers:
      - env:
        - name: TRACING_ENABLED
          value: "true"
        - name: METRICS_ENABLED
          value: "true"
        - name: LOGS_ENABLED
          value: "false"
        - name: BACKEND_HOST
          value: http://bore.pub:39548/api-alaz
        - name: LOG_LEVEL
          value: "1"
        # - name: EXCLUDE_NAMESPACES
        #   value: "^anteon.*"
        - name: MONITORING_ID
          value: 7c6a484a-ec47-46a6-946d-4071ff6cf883
        - name: SEND_ALIVE_TCP_CONNECTIONS  # Send undetected protocol connections (unknown connections)
          value: "false"
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        args:
        - --no-collector.wifi
        - --no-collector.hwmon
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*)$
        image: ddosify/alaz:v0.11.3
        imagePullPolicy: IfNotPresent
        name: alaz-pod
        ports:
        - containerPort: 8181
          protocol: TCP
        resources:
          limits:
            memory: 1Gi
          requests:
            cpu: "1"
            memory: 400Mi
        securityContext:
          privileged: true 
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        # needed for linking ebpf trace programs
        volumeMounts:
        - mountPath: /sys/kernel/debug
          name: debugfs
          readOnly: false
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: alaz-serviceaccount
      serviceAccountName: alaz-serviceaccount
      terminationGracePeriodSeconds: 30
      # needed for linking ebpf trace programs
      volumes:
      - name: debugfs
        hostPath:
          path: /sys/kernel/debug

Only thing that I did different compared to the documentation was using bore.pub instead of ngrok which shouldn't be a problem I think.

I'm running Arch Linux with the kernel 6.10.1-arch1-1.

The text was updated successfully, but these errors were encountered:

orhun · 2024-08-12T15:51:55Z

I'm getting the same issue when I deploy via Helm chart as well:

{"level":"info","tag":"v0.12.0","time":1723477886,"message":"alaz tag"}
{"level":"info","time":1723477886,"message":"k8sCollector initializing..."}
{"level":"info","time":1723477886,"message":"Connected successfully to CRI using endpoint unix:///proc/1/root/run/containerd/containerd.sock"}
{"level":"error","time":1723477887,"message":"error creating gpu collector: failed to load nvidia driver: <nil>"}
{"level":"error","time":1723477887,"message":"error exporting gpu metrics: failed to load nvidia driver: <nil>"}
panic: runtime error: integer divide by zero

goroutine 85 [running]:
github.com/ddosify/alaz/aggregator.(*ClusterInfo).handleSocketMapCreation(0xc0002fcd90)
	/app/aggregator/cluster.go:89 +0x33d
created by github.com/ddosify/alaz/aggregator.newClusterInfo in goroutine 1
	/app/aggregator/cluster.go:59 +0x1a9

orhun · 2024-08-12T15:55:20Z

I guess there is a race condition on this line:

alaz/aggregator/cluster.go

Line 89 in 2f383f1

i := (ci.muIndex.Load()) % uint64(len(ci.muArray))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashes on self-hosted with panic: runtime error: integer divide by zero #182

Crashes on self-hosted with panic: runtime error: integer divide by zero #182

orhun commented Aug 9, 2024 •

edited

Loading

orhun commented Aug 12, 2024

orhun commented Aug 12, 2024

Crashes on self-hosted with panic: runtime error: integer divide by zero #182

Crashes on self-hosted with panic: runtime error: integer divide by zero #182

Comments

orhun commented Aug 9, 2024 • edited Loading

orhun commented Aug 12, 2024

orhun commented Aug 12, 2024

orhun commented Aug 9, 2024 •

edited

Loading