2021-02-23 - University of Toronto Hub Partial Outage #87

yuvipanda · 2021-03-01T19:01:20Z

Summary

A YAML nesting misconfiguration in the Zero to JupyterHub upgrade (#82) a week prior prevented the user image from being pulled onto nodes where it didn't already exist. Two nodes were active during from before the upgrade, so the image was already present in those. When there were enough users to fill those nodes, a third node was automatically provisioned - but the user image could not be pulled there due to the YAML nesting misconfiguration! This caused server starts to fail once there were just enough users on the hub to fill the two 'working' nodes.

The YAML nesting misconfiguration was fixed, and a subsequent login related issue from a misdeploy was also fixed.

Timeline

All times in IST (+0530)

2021-02-23 08:14 AM

Reports of students not being able to log in start coming in from instructors on the Jupyter Community of Practice room on Microsoft Teams.

08:30 AM

Notice is posted on the UToronto system status page, letting users know service is degraded.

06:11 PM

2i2c engineers notice alarm in the Microsoft Teams chat, investigation starts. However, new server starts work now, so a deeper investigation is delayed.

06:49 PM

New server starts are reported broken again.

08:20 PM

Deeper investigation starts again. Since docker image pulls were being denied, this breaking change in z2jh seemed relevant. Looking at the changelog and our current set of config, it looked like there was a YAML nesting error. While the changelog required This meant that all existing nodes at the time of the migration had working authentication but new nodes did not. This meant the first 3 nodes worth of users (~200ish) were able to start properly, but any more triggered a new node which couldn't pull new image due to this issue.

The z2jh PR suggests moving imagePullSecret to the top level. However, we deploy the JupyterHub chart as a helm dependency, so it would have to be nested under a jupyterhub key. This was missed during review and testing of the upgrade PR, since the effect was delayed.

So, yay YAML nesting issues? This was fixed by this PR, and was deployed manually with a local hubploy deploy for expediency.

08:47 PM

Server starts work again, including on new nodes! Yay!

09:23 PM

Reports that new logins are unsuccessful - on clicking login, users are redirected to the same page. Users who are already logged in can start servers.

Upon more investigation, it turns out that the 'expedient' local deploy with hubploy was using a development version of hubploy that was trying to fix this bug, and wasn't fully functional. This left the hub using a combination of the old and new z2jh versions, leading to this strange error.

10:17 PM

All fixed now.

11:19 PM

UofT system status page was updated to mark the incident as resolved.

Things to improve

There should be an automated health checker that informs us of server start failures. This reduces reliance on a human chain of reporting that comes up to 2i2c.
Streamline escalation communication channels - the notification from Microsoft Teams was missed for several hours, but perhaps something from pagerduty coupled with (1) would not have been missed as easily.
Documented process around posting updates on UofT System Status. Avi was very helpful posting statuses this time, but we should document how to quickly communicate to our users during an outage.
Hubploy should provide more useful diagnostic status messages as it goes along, to make issues like this easier to spot.
We should try catch YAML configuration errors - perhaps by testing them against the schema of the z2jh helm chart?

Action items

Figure out how to get messages posted on to systemstatus.utoronto.ca
Work on automated alerts delivered via pagerduty
Use this incident to inform the zero to jupyterhub upgrade process
Fix Fetch helm dependencies only if needed berkeley-dsep-infra/hubploy#109
Discuss expected SLA from 2i2c for communications on Microsoft Teams
Discuss expected SLA from 2i2c for incident response

The text was updated successfully, but these errors were encountered:

yuvipanda · 2021-03-01T19:02:16Z

I hope to spend a day gathering input from other 2i2c folks, and fleshing out the action items some more.

choldgraf · 2021-03-01T19:28:39Z

Thanks @yuvipanda for this helpful write-up.

A few thoughts:

Agreed that we need to improve the reporting mechanisms, particularly automatic ones
re: communications: I think one challenge here is that we're using a different communication space for interfacing w/ UoT vs. what we use for the other infrastructure. We should think about whether or how to connect the two more reliably, or how we can sustainably dedicate resources to paying attention to "other" communication spaces than our own.
re: Hub Status: I think an easier short-term solution is to have some "hub status" page that is not integrated with a customer's own system status. Better to provide a pattern for any hub first rather than special-case UoT.

Suggestions:

Could we set up a CRON job that will run tests or actions on our hubs in order to catch these kinds of bugs before a user action triggers them? We probably don't want to actually request new nodes, but we could do many of the things that often happen when a new node is triggered (like attempt a Docker pull and a new log-in+server start).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2021-02-23 - University of Toronto Hub Partial Outage #87

2021-02-23 - University of Toronto Hub Partial Outage #87

yuvipanda commented Mar 1, 2021 •

edited by choldgraf

Loading

yuvipanda commented Mar 1, 2021

choldgraf commented Mar 1, 2021 •

edited

Loading

2021-02-23 - University of Toronto Hub Partial Outage #87

2021-02-23 - University of Toronto Hub Partial Outage #87

Comments

yuvipanda commented Mar 1, 2021 • edited by choldgraf Loading

Summary

Timeline

2021-02-23 08:14 AM

08:30 AM

06:11 PM

06:49 PM

08:20 PM

08:47 PM

09:23 PM

10:17 PM

11:19 PM

Things to improve

Action items

yuvipanda commented Mar 1, 2021

choldgraf commented Mar 1, 2021 • edited Loading

yuvipanda commented Mar 1, 2021 •

edited by choldgraf

Loading

choldgraf commented Mar 1, 2021 •

edited

Loading