Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient issues spinning up servers related to pod reflector errors #1103

Closed
2 tasks
sgibson91 opened this issue Mar 15, 2022 · 9 comments · Fixed by #1137
Closed
2 tasks

Transient issues spinning up servers related to pod reflector errors #1103

sgibson91 opened this issue Mar 15, 2022 · 9 comments · Fixed by #1137

Comments

@sgibson91
Copy link
Member

sgibson91 commented Mar 15, 2022

Context

We often get support tickets along the lines of "can't spin up a user server" that appear to be transient in nature. Upon inspection of the logs, we see pod reflector errors which can be caused either by a k8s master API outage, or a race condition in the hub.

I have opened an issue in the jupyterhub/grafana-dashboards repo to ask for the k8s master API stats to be included since this will help us debug these types of issue: jupyterhub/grafana-dashboards#34

Deleting the hub pod and allowing it to be recreated should help things if it was a race condition in the hub too.

Specifically, the pilot-hubs cluster is zonal, not regional, which means it's k8s master API is not highly available and is therefore more prone to these issues. See #1102

Actions and updates

@yuvipanda
Copy link
Member

I've bounced (aka restarted) all the hub pods on the pilot-hubs cluster with this line:

kubectl get ns | choose 0 | rg -v aup | xargs -L1  kubectl  delete pod -l component=hub -n    

choose is an awesome alternative to cut or awk, and rg is ripgrep an awesome alternative to grep. I skipped aup because I had already restarted that hub pod, and didn't want to do it again.

@yuvipanda
Copy link
Member

In this case, I think restarting the hub was the right call - I think we were just seeing jupyterhub/kubespawner#525 again. I see the following line in the log:

[C 2022-03-15 03:26:39.979 JupyterHub spawner:2222] Pods reflector failed, halting Hub.

But the hub does not halt :)

I think what happened is:

  1. k8s api had some downtime / latency issues
  2. hub's reflectors failed multiple times during this, causing kubespawner to decide it needs to shut itself down
  3. This triggers Terminate process correctly from reflector thread jupyterhub/kubespawner#525, leaving the hub in a zombie state.

Restarting the pods fixes this, and upgrading to latest z2jh (which has jupyterhub/kubespawner#525 in it) will fix this for good.

@yuvipanda
Copy link
Member

I think upgrading z2jh will help fix this.

@choldgraf
Copy link
Member

Just noting that this issue is tracking upgrading Z2JH / JupyterHub to 2.0 (though maybe you're suggesting we upgrade to a later 1.x branch?

@yuvipanda
Copy link
Member

@choldgraf nope, 2.x should fix it.

@choldgraf
Copy link
Member

@yuvipanda awesome, I've updated the top comment to clarify the next steps here

@choldgraf
Copy link
Member

Just noting that we have another incident related to this issue:

It looks like we can't solve this for good until z2jh 2.0 is released. Since this has become a fairly common problem, as a stopgap maybe we can share the actions that often resolve this problem in the top comment of this issue (or somewhere else?)

It seems like the easiest thing is to restart the hub pod, and it then works ok when it comes back. Is that right?

@sgibson91
Copy link
Member Author

It seems like the easiest thing is to restart the hub pod, and it then works ok when it comes back. Is that right?

Yes

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 19, 2022
2i2c-org#1103 is happening
more frequently now, and it's a hard fail - many users just can
not start up their servers. The z2jh upgrade will involve more
work (2i2c-org#1055),
so let's just bump up kubespawner in our custom hub image until then.

Fixes 2i2c-org#1103
@yuvipanda
Copy link
Member

@choldgraf we don't need to wait for z2jh 2.0 (which will involve other work to upgrade) to get this fixed, as there's a released version of kubespawner with the fix. #1137 should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants