Skip to content
This repository has been archived by the owner on Jan 12, 2022. It is now read-only.

2021-02-23 - University of Toronto Hub Partial Outage #87

Open
yuvipanda opened this issue Mar 1, 2021 · 2 comments
Open

2021-02-23 - University of Toronto Hub Partial Outage #87

yuvipanda opened this issue Mar 1, 2021 · 2 comments

Comments

@yuvipanda
Copy link
Contributor

yuvipanda commented Mar 1, 2021

This is a blameless postmortem

Summary

A YAML nesting misconfiguration in the Zero to JupyterHub upgrade (#82) a week prior prevented the user image from being pulled onto nodes where it didn't already exist. Two nodes were active during from before the upgrade, so the image was already present in those. When there were enough users to fill those nodes, a third node was automatically provisioned - but the user image could not be pulled there due to the YAML nesting misconfiguration! This caused server starts to fail once there were just enough users on the hub to fill the two 'working' nodes.

The YAML nesting misconfiguration was fixed, and a subsequent login related issue from a misdeploy was also fixed.

Timeline

All times in IST (+0530)

2021-02-23 08:14 AM

Reports of students not being able to log in start coming in from instructors on the Jupyter Community of Practice room on Microsoft Teams.

image

08:30 AM

Notice is posted on the UToronto system status page, letting users know service is degraded.

06:11 PM

2i2c engineers notice alarm in the Microsoft Teams chat, investigation starts. However, new server starts work now, so a deeper investigation is delayed.

06:49 PM

New server starts are reported broken again.

08:20 PM

Deeper investigation starts again. Since docker image pulls were being denied, this breaking change in z2jh seemed relevant. Looking at the changelog and our current set of config, it looked like there was a YAML nesting error. While the changelog required This meant that all existing nodes at the time of the migration had working authentication but new nodes did not. This meant the first 3 nodes worth of users (~200ish) were able to start properly, but any more triggered a new node which couldn't pull new image due to this issue.

The z2jh PR suggests moving imagePullSecret to the top level. However, we deploy the JupyterHub chart as a helm dependency, so it would have to be nested under a jupyterhub key. This was missed during review and testing of the upgrade PR, since the effect was delayed.

So, yay YAML nesting issues? This was fixed by this PR, and was deployed manually with a local hubploy deploy for expediency.

08:47 PM

Server starts work again, including on new nodes! Yay!

09:23 PM

Reports that new logins are unsuccessful - on clicking login, users are redirected to the same page. Users who are already logged in can start servers.

Upon more investigation, it turns out that the 'expedient' local deploy with hubploy was using a development version of hubploy that was trying to fix this bug, and wasn't fully functional. This left the hub using a combination of the old and new z2jh versions, leading to this strange error.

10:17 PM

All fixed now.

11:19 PM

UofT system status page was updated to mark the incident as resolved.

Things to improve

  1. There should be an automated health checker that informs us of server start failures. This reduces reliance on a human chain of reporting that comes up to 2i2c.
  2. Streamline escalation communication channels - the notification from Microsoft Teams was missed for several hours, but perhaps something from pagerduty coupled with (1) would not have been missed as easily.
  3. Documented process around posting updates on UofT System Status. Avi was very helpful posting statuses this time, but we should document how to quickly communicate to our users during an outage.
  4. Hubploy should provide more useful diagnostic status messages as it goes along, to make issues like this easier to spot.
  5. We should try catch YAML configuration errors - perhaps by testing them against the schema of the z2jh helm chart?

Action items

  1. Figure out how to get messages posted on to systemstatus.utoronto.ca
  2. Work on automated alerts delivered via pagerduty
  3. Use this incident to inform the zero to jupyterhub upgrade process
  4. Fix Fetch helm dependencies only if needed berkeley-dsep-infra/hubploy#109
  5. Discuss expected SLA from 2i2c for communications on Microsoft Teams
  6. Discuss expected SLA from 2i2c for incident response
@yuvipanda
Copy link
Contributor Author

I hope to spend a day gathering input from other 2i2c folks, and fleshing out the action items some more.

@choldgraf
Copy link
Contributor

choldgraf commented Mar 1, 2021

Thanks @yuvipanda for this helpful write-up.

A few thoughts:

  • Agreed that we need to improve the reporting mechanisms, particularly automatic ones
  • re: communications: I think one challenge here is that we're using a different communication space for interfacing w/ UoT vs. what we use for the other infrastructure. We should think about whether or how to connect the two more reliably, or how we can sustainably dedicate resources to paying attention to "other" communication spaces than our own.
  • re: Hub Status: I think an easier short-term solution is to have some "hub status" page that is not integrated with a customer's own system status. Better to provide a pattern for any hub first rather than special-case UoT.

Suggestions:

  • Could we set up a CRON job that will run tests or actions on our hubs in order to catch these kinds of bugs before a user action triggers them? We probably don't want to actually request new nodes, but we could do many of the things that often happen when a new node is triggered (like attempt a Docker pull and a new log-in+server start).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants