-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GatewayCluster with kubernetes backend fails to start after update to daskhub v4.5.4 #348
Comments
Hmmm, that's odd. The command passed has an empty string, which should be forwarded correctly through k8s and still parse correctly (the pod spec takes a list of args, and
If you try things with the default images do they work fine? That would help figure out whether the images are the culprit or if it's a k8s difference. Switching the (note - I'm off for the next week, so responses here may be a bit slow) |
@jcrist I take it back! I was not changing my cluster options correctly and hadn't actually switched images. When I use the default image it works! I'll dig into the couple of things you mentioned to figure out why our image is causing this behavior |
In the end, I just decided to go with the default scheduler and use our customized image just on the worker. Seems to work if the important packages are pinned to the same version across these images. I'm sure it wouldn't take too much digging to figure out what was going on. Let me know if you think that would be helpful and I'm happy to give it a bit more effort. Otherwise, I think we can close this |
I'd really like to figure out what's going on here. Can you post the |
This is what we were using for a Dockerfile. The conda env referenced in the file is here. Note that some commands in the prepare.sh script run by tini fail when run on this scheduler image. This is an artifact of also using that startup script for our worker images, which are much heavier weight and were taking too long to spin up as the scheduler. This did not cause issues with the previous helm chart but could be part of the reason we were getting the issue with the dask-scheduler command. |
This patch version upgrade of daskhub was an upgrade from dask-gateway 0.8.0 to 0.9.0 and I wonder if JupyterHub also introduced some network policies etc by default between 0.9.1 and 0.10.X. If that's the case, they suddenly require #352 which hasn't been released yet.
Was JupyterHub's network policy at If so, this is a duplicated of #360, closed by #352, but awaiting #381. |
@consideRatio hmm...it's been awhile but I can confirm that something between nov 2020 and now has solved this issue and we are now using our custom scheduler images (with daskhub 2021.8.1). I'm pretty certain you are right - we definitely were not explicitly disabling the |
Thanks for following up @bolliger32! |
What happened:
I received the following error:
GatewayClusterError: Cluster 'adrastea.b4286778ea9b49f4b4264f982f5b278d' failed to start, see logs for more information
. The logs suggest that thedask-scheduler
command is missing an argument after--host
, which looks intentional based on this code. Here are the logs:This occurred upon update from v4.5.3 of the daskhub chart to v4.5.4. Note that several other issues occurred related to jupyterhub. I eventually worked my way through those and ultimately just deleted and recreated our GKE cluster. That fixed these other issues (primarily related to authentication) but this one remains.
What you expected to happen:
A working GatewayCluster object to be returned from the
gateway.new_cluster()
callMinimal Complete Verifiable Example:
I'd imagine a lot of the reproducibility depends on our specific GKE infrastructure and chart config, but the actual code that raises this bug is just
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: