You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 12, 2022. It is now read-only.
A YAML nesting misconfiguration in the Zero to JupyterHub upgrade (#82) a week prior prevented the user image from being pulled onto nodes where it didn't already exist. Two nodes were active during from before the upgrade, so the image was already present in those. When there were enough users to fill those nodes, a third node was automatically provisioned - but the user image could not be pulled there due to the YAML nesting misconfiguration! This caused server starts to fail once there were just enough users on the hub to fill the two 'working' nodes.
The YAML nesting misconfiguration was fixed, and a subsequent login related issue from a misdeploy was also fixed.
Timeline
All times in IST (+0530)
2021-02-23 08:14 AM
Reports of students not being able to log in start coming in from instructors on the Jupyter Community of Practice room on Microsoft Teams.
08:30 AM
Notice is posted on the UToronto system status page, letting users know service is degraded.
06:11 PM
2i2c engineers notice alarm in the Microsoft Teams chat, investigation starts. However, new server starts work now, so a deeper investigation is delayed.
06:49 PM
New server starts are reported broken again.
08:20 PM
Deeper investigation starts again. Since docker image pulls were being denied, this breaking change in z2jh seemed relevant. Looking at the changelog and our current set of config, it looked like there was a YAML nesting error. While the changelog required This meant that all existing nodes at the time of the migration had working authentication but new nodes did not. This meant the first 3 nodes worth of users (~200ish) were able to start properly, but any more triggered a new node which couldn't pull new image due to this issue.
The z2jh PR suggests moving imagePullSecret to the top level. However, we deploy the JupyterHub chart as a helm dependency, so it would have to be nested under a jupyterhub key. This was missed during review and testing of the upgrade PR, since the effect was delayed.
So, yay YAML nesting issues? This was fixed by this PR, and was deployed manually with a local hubploy deploy for expediency.
08:47 PM
Server starts work again, including on new nodes! Yay!
09:23 PM
Reports that new logins are unsuccessful - on clicking login, users are redirected to the same page. Users who are already logged in can start servers.
Upon more investigation, it turns out that the 'expedient' local deploy with hubploy was using a development version of hubploy that was trying to fix this bug, and wasn't fully functional. This left the hub using a combination of the old and new z2jh versions, leading to this strange error.
There should be an automated health checker that informs us of server start failures. This reduces reliance on a human chain of reporting that comes up to 2i2c.
Streamline escalation communication channels - the notification from Microsoft Teams was missed for several hours, but perhaps something from pagerduty coupled with (1) would not have been missed as easily.
Documented process around posting updates on UofT System Status. Avi was very helpful posting statuses this time, but we should document how to quickly communicate to our users during an outage.
Hubploy should provide more useful diagnostic status messages as it goes along, to make issues like this easier to spot.
We should try catch YAML configuration errors - perhaps by testing them against the schema of the z2jh helm chart?
Action items
Figure out how to get messages posted on to systemstatus.utoronto.ca
Agreed that we need to improve the reporting mechanisms, particularly automatic ones
re: communications: I think one challenge here is that we're using a different communication space for interfacing w/ UoT vs. what we use for the other infrastructure. We should think about whether or how to connect the two more reliably, or how we can sustainably dedicate resources to paying attention to "other" communication spaces than our own.
re: Hub Status: I think an easier short-term solution is to have some "hub status" page that is not integrated with a customer's own system status. Better to provide a pattern for any hub first rather than special-case UoT.
Suggestions:
Could we set up a CRON job that will run tests or actions on our hubs in order to catch these kinds of bugs before a user action triggers them? We probably don't want to actually request new nodes, but we could do many of the things that often happen when a new node is triggered (like attempt a Docker pull and a new log-in+server start).
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
This is a blameless postmortem
Summary
A YAML nesting misconfiguration in the Zero to JupyterHub upgrade (#82) a week prior prevented the user image from being pulled onto nodes where it didn't already exist. Two nodes were active during from before the upgrade, so the image was already present in those. When there were enough users to fill those nodes, a third node was automatically provisioned - but the user image could not be pulled there due to the YAML nesting misconfiguration! This caused server starts to fail once there were just enough users on the hub to fill the two 'working' nodes.
The YAML nesting misconfiguration was fixed, and a subsequent login related issue from a misdeploy was also fixed.
Timeline
All times in IST (+0530)
2021-02-23 08:14 AM
Reports of students not being able to log in start coming in from instructors on the Jupyter Community of Practice room on Microsoft Teams.
08:30 AM
Notice is posted on the UToronto system status page, letting users know service is degraded.
06:11 PM
2i2c engineers notice alarm in the Microsoft Teams chat, investigation starts. However, new server starts work now, so a deeper investigation is delayed.
06:49 PM
New server starts are reported broken again.
08:20 PM
Deeper investigation starts again. Since docker image pulls were being denied, this breaking change in z2jh seemed relevant. Looking at the changelog and our current set of config, it looked like there was a YAML nesting error. While the changelog required This meant that all existing nodes at the time of the migration had working authentication but new nodes did not. This meant the first 3 nodes worth of users (~200ish) were able to start properly, but any more triggered a new node which couldn't pull new image due to this issue.
The z2jh PR suggests moving
imagePullSecret
to the top level. However, we deploy the JupyterHub chart as a helm dependency, so it would have to be nested under ajupyterhub
key. This was missed during review and testing of the upgrade PR, since the effect was delayed.So, yay YAML nesting issues? This was fixed by this PR, and was deployed manually with a local hubploy deploy for expediency.
08:47 PM
Server starts work again, including on new nodes! Yay!
09:23 PM
Reports that new logins are unsuccessful - on clicking login, users are redirected to the same page. Users who are already logged in can start servers.
Upon more investigation, it turns out that the 'expedient' local deploy with hubploy was using a development version of hubploy that was trying to fix this bug, and wasn't fully functional. This left the hub using a combination of the old and new z2jh versions, leading to this strange error.
10:17 PM
All fixed now.
11:19 PM
UofT system status page was updated to mark the incident as resolved.
Things to improve
Action items
The text was updated successfully, but these errors were encountered: