Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod unready rule #40

Merged
merged 1 commit into from
Feb 11, 2020
Merged

Pod unready rule #40

merged 1 commit into from
Feb 11, 2020

Conversation

jdharmon
Copy link
Contributor

@jdharmon jdharmon commented Feb 6, 2020

Rule flags pods for reaping based on the time a pod has been unready.

@brianberzins
Copy link
Collaborator

@jdharmon

Hey! Awesome of you to open up a PR!

I'm curious if the use case you're thinking of is the use case that came to my mind when I saw the PR. I was thinking of this as being something like a globally defined liveness probe for all pods on a cluster. It seems like it does something very similar to the liveness probes of a pod spec, but pod-reaper can have significantly more scope, so I was wondering if that's what you're thinking of!

@jdharmon
Copy link
Contributor Author

jdharmon commented Feb 6, 2020

@brianberzins Yes, this is a liveness probe to kill pods with containers that are passing the built-in liveness probe, but failing the readiness probe for an extended period of time.

We encountered an issue in one of our applications where the running container would get disconnected from the database, and be unable to reconnect. When this happens, Kubernetes will stop routing traffic to the pod when the readiness probe fails. If it remains in that state for 10 minutes, we have pod reaper kill it.

@brianberzins
Copy link
Collaborator

Looking to test this out this afternoon/evening!

@brianberzins
Copy link
Collaborator

This looks great!

I really feel like I should setup an automated way of testing this out against a live cluster. Until then, I'm pretty paranoid about making sure this thing works exactly as I expect it to because of what it can do. So here's what I did to test it out!

  • Service account with admin access (ONLY for testing in a local cluster)
  • deployment with pod-reaper and the unready rule set to 1 min
  • deployment with some sleeper containers and a readinessProbe that will never succeed

I deployed all of this - and made sure the reaper found/killed them!

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: test-service-account
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: test-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: test-service-account
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-reaper
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pod-reaper
  template:
    metadata:
      labels:
        app: pod-reaper
    spec:
      serviceAccount: test-service-account
      containers:
      - name: unready
        image: brianberzins/pod-reaper:alpha-02072020
        imagePullPolicy: Always
        resources:
          limits:
            cpu: 30m
            memory: 30Mi
          requests:
            cpu: 20m
            memory: 20Mi
        env:
          - name: SCHEDULE
            value: "@every 15s"
          - name: MAX_UNREADY
            value: 1m
          
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleeper
  namespace: kube-system
spec:
  replicas: 5
  selector:
    matchLabels:
      app: sleeper
  template:
    metadata:
      labels:
        app: sleeper
    spec:
      containers:
      - name: sleeper
        image: ubuntu
        command:
          - sleep
          - infinity
        readinessProbe:
          exec:
            command:
            - cat
            - does-not-exist

With logs:

{"level":"info","msg":"loaded rule: maximum unready 1m","time":"2020-02-07T21:42:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-2rq7f","reasons":["has been unready for 1m11.383988234s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-4mh55","reasons":["has been unready for 1m10.388264631s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-fsc46","reasons":["has been unready for 1m11.39190452s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-hhtbb","reasons":["has been unready for 1m11.481532598s"],"time":"2020-02-07T21:43:08Z"}
{"level":"info","msg":"reaping pod","pod":"sleeper-5f86dcfc-x8b4w","reasons":["has been unready for 1m11.48549029s"],"time":"2020-02-07T21:43:08Z"}=

@brianberzins
Copy link
Collaborator

@slushpupie
Looks Good To Me!

Copy link
Collaborator

@brianberzins brianberzins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I built and tested this out locally (I'm on the paranoid side with this thing!)
Everything good!
Appreciate the style match, even if my go code isn't amazing!

@brianberzins brianberzins merged commit b1e357e into target:master Feb 11, 2020
@jdharmon jdharmon deleted the unready branch February 11, 2020 18:14
@brianberzins
Copy link
Collaborator

@jdharmon

Thanks again for the pull-request!
Are there any other things you'd be interested in seeing?
I rarely get feedback on pod-reaper and try to generally assume that no complaints means people are happy with it, but I honestly don't know!

@jdharmon
Copy link
Contributor Author

@brianberzins No problem. Thank you for pod reaper. It saved me from having to write something from scratch.

It would be nice to be able to override pod-reaper settings with labels. For example if MAX_DURATION=1d, but a pod had the label pod-reaper/maxduration: 12h, then that pod would be reaped in 12 hours.

@brianberzins
Copy link
Collaborator

@jdharmon Huh! That's something I haven't thought about before. I just went down a rabbit hole in my head! If you're up for it, I think I'll make another issue for this sometime soon and bounce some possible implementations ideas your way to see if any of them make sense to you!

@JordanSussman
Copy link
Contributor

@brianberzins No problem. Thank you for pod reaper. It saved me from having to write something from scratch.

It would be nice to be able to override pod-reaper settings with labels. For example if MAX_DURATION=1d, but a pod had the label pod-reaper/maxduration: 12h, then that pod would be reaped in 12 hours.

Would you propose that the pod defined label(s) should always override the pod-reaper defined behavior? Feels like a minimum the default behavior should be to ignore pod defined labels unless a configuration option has been defined on pod-reaper. Suppose you could also go as granular as to define certain "whitelist" of options that pods are allowed to override.

@jdharmon
Copy link
Contributor Author

Created issue #44 so discussion is not lost in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants