Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] ANU hub unable to spawn instances #1135

Closed
5 tasks
yuvipanda opened this issue Mar 19, 2022 · 5 comments
Closed
5 tasks

[Incident] ANU hub unable to spawn instances #1135

yuvipanda opened this issue Mar 19, 2022 · 5 comments

Comments

@yuvipanda
Copy link
Member

Summary

As reported in #1124 (comment), the ANU hub stopped being able to spawn instances.

At 2022-03-18 03:56:42 UTC, the first reflector error shows up in the hub log. Within one minute, at 03:57:33, the message Events reflector failed, halting Hub. is emitted. However, we run into #1103 at that point, and the hub is left in a weird indeterminate state. The ANU class is at 5 AM UTC, one full hour after this issue.

I've performed #1103 (comment) to get the hubs unstuck, but we should increase the priority of this now.

Impact on users

Users were not able to use the hub for the tutorial

Important information

Tasks and updates

  • Discuss and address incident, leaving comments below with updates
  • Incident has been dealt with or is over
  • Copy/paste the after-action report below and fill in relevant sections
  • Incident title is discoverable and accurate
  • All actionable items in report have linked GitHub Issues
After-action report template
# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]
@yuvipanda
Copy link
Member Author

#1137 should prevent this from happening again. More details at #1103. Thanks for reporting this, @mmcky

@mmcky
Copy link

mmcky commented Mar 23, 2022

thanks @yuvipanda -- greatly appreciated.

@yuvipanda
Copy link
Member Author

yw! closing this as #1137 was merged

@mmcky
Copy link

mmcky commented Apr 2, 2022

We recently had an assignment due so ~120 students used the platform to do their assignment work and from all reports the jupyterhub service worked well. Thank you @yuvipanda

@yuvipanda
Copy link
Member Author

Very glad to hear, @mmcky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants