Pod unready rule #40

jdharmon · 2020-02-06T16:02:46Z

Rule flags pods for reaping based on the time a pod has been unready.

brianberzins · 2020-02-06T16:39:33Z

Hey! Awesome of you to open up a PR!

I'm curious if the use case you're thinking of is the use case that came to my mind when I saw the PR. I was thinking of this as being something like a globally defined liveness probe for all pods on a cluster. It seems like it does something very similar to the liveness probes of a pod spec, but pod-reaper can have significantly more scope, so I was wondering if that's what you're thinking of!

jdharmon · 2020-02-06T17:47:41Z

@brianberzins Yes, this is a liveness probe to kill pods with containers that are passing the built-in liveness probe, but failing the readiness probe for an extended period of time.

We encountered an issue in one of our applications where the running container would get disconnected from the database, and be unable to reconnect. When this happens, Kubernetes will stop routing traffic to the pod when the readiness probe fails. If it remains in that state for 10 minutes, we have pod reaper kill it.

brianberzins · 2020-02-07T21:05:29Z

Looking to test this out this afternoon/evening!

brianberzins · 2020-02-07T21:48:06Z

This looks great!

I really feel like I should setup an automated way of testing this out against a live cluster. Until then, I'm pretty paranoid about making sure this thing works exactly as I expect it to because of what it can do. So here's what I did to test it out!

Service account with admin access (ONLY for testing in a local cluster)
deployment with pod-reaper and the unready rule set to 1 min
deployment with some sleeper containers and a readinessProbe that will never succeed

I deployed all of this - and made sure the reaper found/killed them!

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: test-service-account
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: test-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: test-service-account
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-reaper
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pod-reaper
  template:
    metadata:
      labels:
        app: pod-reaper
    spec:
      serviceAccount: test-service-account
      containers:
      - name: unready
        image: brianberzins/pod-reaper:alpha-02072020
        imagePullPolicy: Always
        resources:
          limits:
            cpu: 30m
            memory: 30Mi
          requests:
            cpu: 20m
            memory: 20Mi
        env:
          - name: SCHEDULE
            value: "@every 15s"
          - name: MAX_UNREADY
            value: 1m
          
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleeper
  namespace: kube-system
spec:
  replicas: 5
  selector:
    matchLabels:
      app: sleeper
  template:
    metadata:
      labels:
        app: sleeper
    spec:
      containers:
      - name: sleeper
        image: ubuntu
        command:
          - sleep
          - infinity
        readinessProbe:
          exec:
            command:
            - cat
            - does-not-exist

With logs:

{"level":"info","msg":"loaded rule: maximum unready 1m","time":"2020-02-07T21:42:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-2rq7f","reasons":["has been unready for 1m11.383988234s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-4mh55","reasons":["has been unready for 1m10.388264631s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-fsc46","reasons":["has been unready for 1m11.39190452s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-hhtbb","reasons":["has been unready for 1m11.481532598s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-x8b4w","reasons":["has been unready for 1m11.48549029s"],"time":"2020-02-07T21:43:08Z"}=

brianberzins · 2020-02-07T21:48:46Z

@slushpupie
Looks Good To Me!

brianberzins

I built and tested this out locally (I'm on the paranoid side with this thing!)
Everything good!
Appreciate the style match, even if my go code isn't amazing!

brianberzins · 2020-02-12T15:43:30Z

@jdharmon

Thanks again for the pull-request!
Are there any other things you'd be interested in seeing?
I rarely get feedback on pod-reaper and try to generally assume that no complaints means people are happy with it, but I honestly don't know!

jdharmon · 2020-02-12T19:24:57Z

@brianberzins No problem. Thank you for pod reaper. It saved me from having to write something from scratch.

It would be nice to be able to override pod-reaper settings with labels. For example if MAX_DURATION=1d, but a pod had the label pod-reaper/maxduration: 12h, then that pod would be reaped in 12 hours.

brianberzins · 2020-02-12T20:56:09Z

@jdharmon Huh! That's something I haven't thought about before. I just went down a rabbit hole in my head! If you're up for it, I think I'll make another issue for this sometime soon and bounce some possible implementations ideas your way to see if any of them make sense to you!

JordanSussman · 2020-02-12T21:24:40Z

@brianberzins No problem. Thank you for pod reaper. It saved me from having to write something from scratch.

It would be nice to be able to override pod-reaper settings with labels. For example if MAX_DURATION=1d, but a pod had the label pod-reaper/maxduration: 12h, then that pod would be reaped in 12 hours.

Would you propose that the pod defined label(s) should always override the pod-reaper defined behavior? Feels like a minimum the default behavior should be to ignore pod defined labels unless a configuration option has been defined on pod-reaper. Suppose you could also go as granular as to define certain "whitelist" of options that pods are allowed to override.

jdharmon · 2020-02-13T14:08:59Z

Created issue #44 so discussion is not lost in this PR.

Added unready rule

9c6932e

jdharmon force-pushed the unready branch from d21a490 to 9c6932e Compare February 6, 2020 16:41

brianberzins mentioned this pull request Feb 10, 2020

Easier local development #43

Open

brianberzins approved these changes Feb 11, 2020

View reviewed changes

brianberzins merged commit b1e357e into target:master Feb 11, 2020

jdharmon deleted the unready branch February 11, 2020 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod unready rule #40

Pod unready rule #40

jdharmon commented Feb 6, 2020

brianberzins commented Feb 6, 2020

jdharmon commented Feb 6, 2020 •

edited

Loading

brianberzins commented Feb 7, 2020

brianberzins commented Feb 7, 2020

brianberzins commented Feb 7, 2020

brianberzins left a comment

brianberzins commented Feb 12, 2020

jdharmon commented Feb 12, 2020

brianberzins commented Feb 12, 2020

JordanSussman commented Feb 12, 2020

jdharmon commented Feb 13, 2020

Pod unready rule #40

Pod unready rule #40

Conversation

jdharmon commented Feb 6, 2020

brianberzins commented Feb 6, 2020

jdharmon commented Feb 6, 2020 • edited Loading

brianberzins commented Feb 7, 2020

brianberzins commented Feb 7, 2020

brianberzins commented Feb 7, 2020

brianberzins left a comment

Choose a reason for hiding this comment

brianberzins commented Feb 12, 2020

jdharmon commented Feb 12, 2020

brianberzins commented Feb 12, 2020

JordanSussman commented Feb 12, 2020

jdharmon commented Feb 13, 2020

jdharmon commented Feb 6, 2020 •

edited

Loading