Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define support ticket urgency levels and practices for how to deal with them #1154

Closed
Tracked by #1068
choldgraf opened this issue Mar 29, 2022 · 8 comments · Fixed by 2i2c-org/team-compass#422
Closed
Tracked by #1068
Assignees
Labels
Enhancement An improvement to something or creating something new.

Comments

@choldgraf
Copy link
Member

choldgraf commented Mar 29, 2022

Context

An important part of the support process is understanding how "urgent" the ticket is. For example - some tickets are general requests that we can finish in a few weeks, others are immediate fires that we need to fix ASAP. Having a system to categorize these tickets will help us make better decisions about our operational time, and reduce stress levels associated with not knowing whether we need to drop everything to fix something.

Proposal

We should define:

  • a few levels of "urgency" for our support tickets
  • criteria for how to categorize tickets into these levels
  • processes that we follow to resolve different kinds of levels

As a result of this, we may need to do further development work to improve our support practices, such as setting up a PagerDuty-like system, but we'll understand that better once we write out the high-level structure.

Implementation guide

See the parent issue for a lot of references to support processes with urgency levels:

@choldgraf choldgraf added Enhancement An improvement to something or creating something new. 🏷️ team-process labels Mar 29, 2022
@yuvipanda
Copy link
Member

I think an important first step is to define an incident, in an objective way that's independent of how Urgent the user thinks something is a problem. An incident is when one of the following is true:

  1. Users can't log-in to the hub
  2. Users can't start a server
  3. (For dask-gateway) Users can't start dask workers

I think when any of these are true, we should 'declare an incident'. https://sre.google/workbook/incident-response/ has some good ideas on what to do when that happens, inspired by actual fire incidents in the literal wild.

@yuvipanda
Copy link
Member

Taking a page out of https://sre.google/workbook/incident-response/#putting-best-practices-into-practice, so here's a very specific but first-draft process recommendation for an incident management workflow.

When a ticket comes in, we perform the following test:

  1. Is it a report about users not being able to log in?
  2. Is it a report about users not being able to start their server?
  3. (For dask-gateway) is it a report about dask-servers not working?

If any of these criteria are met, the support steward declares an incident, by doing the following:

  1. Opening an issue in this repo, using https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D (we will refine this too)

  2. Assign an Incident commander for this particular incident. I like the definition in https://response.pagerduty.com/before/different_roles/, which says they:

    a. Commands and coordinates the incident response, delegating roles as needed. By default, the IC assumes all roles that have not been delegated yet.
    b. Communicates effectively.
    c. Stays in control of the incident response.
    d. Works with other responders to resolve the incident.

    The expectation should be that this is different person than the support steward, to reduce the load on the support steward and recognize they are not responsible for resolving all outages. We should define a process for figuring out who gets to be incident commander separate from this, but this comment just needs to acknowlege that this is a separate role from what the support steward is doing.

  3. Respond to the support ticket by acknowledging the incident, tagging in the incident commander.

The incident commander is responsible for investigating the issue, pulling in people if necessary, and keeping the customer informed via the support ticket. They can also tag out when it is no longer working hours for them - we should engineer process that makes this viable.

How does this sound as a start? I can make this into a PR and we can iterate.

@yuvipanda
Copy link
Member

https://response.pagerduty.com/before/different_roles/ is also a good read.

@yuvipanda
Copy link
Member

Couple more vague thoughts:

  1. It should deeply respect working hours rules, so we don't expect people to be 'up all night' (or outside their working hours, whatever it is). Tagging someone else in to this role is to be expected, so the process should be built around this being a role than a person.
  2. Incidents should be rare - we should have an appropriate post-mortem process to try fix these up. If we are spending too much time on this, we could use error budget techniques (https://sre.google/sre-book/embracing-risk/) to deal with reducing that time.

@damianavila
Copy link
Contributor

The expectation should be that this is different person than the support steward, to reduce the load on the support steward and recognize they are not responsible for resolving all outages. We should define a process for figuring out who gets to be incident commander separate from this, but this comment just needs to acknowlege that this is a separate role from what the support steward is doing.

So, we are thinking about having 2 people in support AND one incident commander from the rest of the team?
Or one (1) of the support folks (2) assume the incident commander role?

@yuvipanda
Copy link
Member

I think this is closed by 2i2c-org/team-compass#422

@damianavila
Copy link
Contributor

I think there are some additional pieces on this merged PR: 2i2c-org/docs#143

@choldgraf, do you want to keep this open for something else?

@choldgraf
Copy link
Member Author

Let's say that 2i2c-org/team-compass#422 closes this one, and we can continue iterating in new / more focused issues from there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement An improvement to something or creating something new.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants