Skip to content

Kubernetes on call rotations

Quinton Hoole edited this page Oct 7, 2015 · 8 revisions

Kubernetes "first responder" rotations

Kubernetes has generated a lot of public traffic: email, pull-requests, bugs, etc. So much traffic that it's becoming impossible to keep up with it all! This is a fantastic problem to have. In order to be sure that SOMEONE, but not EVERYONE on the team is paying attention to public traffic, we have instituted a two "first responder" rotations.

  1. Github and Build Cop Rotation
  2. User Support Rotation

Please read the notes on OSS collaboration, particularly the bits about hours. Specifically, each rotation is expected to be active primarily during work hours, less so off hours.

During regular workday work hours of your shift, your primary responsibility is to monitor the below-detailed traffic sources. You can check traffic in the evenings if you feel so inclined, but it is not expected to be as highly focused as work hours. For weekends, you should check traffic very occasionally (e.g. once or twice a day). Again, it is not expected to be as highly focused as workdays. It is assumed that over time, everyone will get weekday and weekend shifts, so the workload will balance out.

If you can not serve your shift, and you know this ahead of time, it is your responsibility to find someone to cover and to change the rotation. If you have an emergency, your responsibilities fall on the secondary rotation. If you need help to cover all of the tasks, ask the secondary oncall and/or partners with oncall rotations (e.g., Redhat).

If you are not on duty you DO NOT need to do these things. You are free to focus on "real work".

Note that Kubernetes will occasionally enter code slush/freeze, prior to milestones. When it does, there might be changes in the instructions (assigning milestones, for instance).

Preqrequisites

Traffic sources and responsibilities

  • GitHub https://github.com/kubernetes/kubernetes/issues and https://github.com/kubernetes/kubernetes/pulls: Your job is to be the first responder to all new issues and PRs. If you are not equipped to do this (which is fine!), it is your job to seek guidance!
    • All incoming issues should be tagged with the right labels:
      • a team label (team/X)
        • For issues that overlap teams, you can use multiple team labels
        • Current teams and team members also have corresponding github teams, with descriptions
      • a priority label (priority/pX) is optional. If not applied the owner of that team will manage priorities.
        • if the issue to reporting broken builds, broken e2e tests, or other obvious P0 issues, label the issue with priority/P0 and assign it to someone
      • non-P0 issues do not need a reviewer assigned initially
    • All incoming PRs should be assigned a reviewer.
      • unless it is a WIP, RFC, or design proposal.
      • An auto-assigner is in progress
      • When in doubt, choose a TL or team maintainer of the most relevant team; they can delegate
    • Keep in mind that you can @ mention people in an issue/PR to bring it to their attention without assigning it to them. You can also @ mention github teams, such as @kubernetes/goog-ux or @kubernetes/kubectl
    • If you need help triaging an issue or PR, consult with (or assign it to) @brendandburns, @thockin, or @bgrant0607.
    • Be fair to the next person in rotation: try to ensure that every issue that gets filed while you are on duty is handled. It's useful to query all PRs/issues without the appropriate labels (using -label:foo search filters).
  • StackOverflow: Respond to any thread that has no responses and is more than 6 hours old (over time we will lengthen this timeout to allow community responses). If you are not equipped to respond, it is your job to redirect to someone who can.
  • IRC (irc.freenode.net #google-containers): Your job is to be on IRC, watching for questions and answering or redirecting as needed. Also check out the IRC logs.
  • Email/Groups: Respond to any thread that has no responses and is more than 6 hours old (over time we will lengthen this timeout to allow community responses). If you are not equipped to respond, it is your job to redirect to someone who can.

In general, try to direct support questions to:

  1. Documentation, such as the user guide and troubleshooting guide
  2. Stackoverflow

If you see questions on a forum other than Stackoverflow, try to redirect them to Stackoverflow. Example response:

Please re-post your question to [stackoverflow](http://stackoverflow.com/questions/tagged/kubernetes). 

We are trying to consolidate the channels to which questions for help/support are posted so that we can improve our efficiency in responding to your requests, and to make it easier for you to find answers to frequently asked questions and how to address common use cases. 

We regularly see messages posted in multiple forums, with the full response thread only in one place or, worse, spread across multiple forums. Also, the large volume of support issues on github is making it difficult for us to use issues to identify real bugs.

The Kubernetes team scans stackoverflow on a regular basis, and will try to ensure your questions don't go unanswered.

Before posting a new question, please search stackoverflow for answers to similar questions, and also familiarize yourself with:
  * [the user guide](http://kubernetes.io/v1.0/)
  * [the troubleshooting guide](http://kubernetes.io/v1.0/docs/troubleshooting.html)

Again, thanks for using Kubernetes.

The Kubernetes Team

If you answer a question (in any of the above forums) that you think might be useful for someone else in the future, please add it to one of the FAQs: User FAQ, Developer FAQ, Debugging FAQ. Getting it into the FAQ is more important than polish. Please indicate the date it was added, so people can judge the likelihood that it is out-of-date (and please correct any FAQ entries that you see contain out-of-date information).

Build-copping

  • If you are a weekday oncall, merge any PRs (including ones that were submitted the previous night or in the case of the Monday oncall over the weekend) that:
    • Have been LGTMd
    • Pass Travis and Shippable.
    • Author has signed CLA if applicable.
  • If you are a weekend oncall, never merge PRs, instead add the label "lgtm" to the PRs once they have been LGTMd and passed Travis and Shippable; this will make them easy to find by the next oncall, who will merge them.
    • When the build is broken, roll back the PRs responsible ASAP
    • When E2E tests are unstable, a "merge freeze" may be instituted. During a merge freeze:
      • Who ever a PR is assigned to for review, should only label it "lgtm" but not merge it.
      • Oncall should slowly merge LGTMd changes throughout the day while monitoring E2E to ensure stability.
      • Ideally the E2E run should be green, but some tests are flaky and can fail randomly (not as a result of a particular change).
        • If a large number of tests fail, or tests that normally pass fail, that is an indication that one or more of the PRs in that build might be problematic (and should be reverted).
        • Use the Test Results Analyzer to see individual test history over time.

Contact information

@k8s-oncall will reach the current person on call.