Closed
Description
Context
We often get support tickets along the lines of "can't spin up a user server" that appear to be transient in nature. Upon inspection of the logs, we see pod reflector errors
which can be caused either by a k8s master API outage, or a race condition in the hub.
I have opened an issue in the jupyterhub/grafana-dashboards repo to ask for the k8s master API stats to be included since this will help us debug these types of issue: jupyterhub/grafana-dashboards#34
Deleting the hub pod and allowing it to be recreated should help things if it was a race condition in the hub too.
Specifically, the pilot-hubs cluster is zonal, not regional, which means it's k8s master API is not highly available and is therefore more prone to these issues. See #1102
Actions and updates
- Resolve Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055
- Check whether this solved our problem (which will probably require just noticing whether this behavior pops up over time)
Metadata
Metadata
Assignees
Labels
No labels