Transient issues spinning up servers related to pod reflector errors

### Context

We often get support tickets along the lines of "can't spin up a user server" that appear to be transient in nature. Upon inspection of the logs, we see `pod reflector errors` which can be caused either by a k8s master API outage, or a race condition in the hub.

I have opened an issue in the jupyterhub/grafana-dashboards repo to ask for the k8s master API stats to be included since this will help us debug these types of issue: https://github.com/jupyterhub/grafana-dashboards/issues/34

Deleting the hub pod and allowing it to be recreated should help things if it was a race condition in the hub too.

Specifically, the pilot-hubs cluster is zonal, not regional, which means it's k8s master API is not highly available and is therefore more prone to these issues. See https://github.com/2i2c-org/infrastructure/issues/1102


### Actions and updates

- [ ] Resolve https://github.com/2i2c-org/infrastructure/issues/1055
- [ ] Check whether this solved our problem (which will probably require just noticing whether this behavior pops up over time)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transient issues spinning up servers related to pod reflector errors #1103

Context

Actions and updates

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Transient issues spinning up servers related to pod reflector errors #1103

Description

Context

Actions and updates

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions