Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GatewayCluster with kubernetes backend fails to start after update to daskhub v4.5.4 #348

Closed
bolliger32 opened this issue Nov 10, 2020 · 9 comments

Comments

@bolliger32
Copy link

bolliger32 commented Nov 10, 2020

What happened:
I received the following error: GatewayClusterError: Cluster 'adrastea.b4286778ea9b49f4b4264f982f5b278d' failed to start, see logs for more information. The logs suggest that the dask-scheduler command is missing an argument after --host, which looks intentional based on this code. Here are the logs:
image

This occurred upon update from v4.5.3 of the daskhub chart to v4.5.4. Note that several other issues occurred related to jupyterhub. I eventually worked my way through those and ultimately just deleted and recreated our GKE cluster. That fixed these other issues (primarily related to authentication) but this one remains.

What you expected to happen:
A working GatewayCluster object to be returned from the gateway.new_cluster() call

Minimal Complete Verifiable Example:
I'd imagine a lot of the reproducibility depends on our specific GKE infrastructure and chart config, but the actual code that raises this bug is just

import dask_gateway

gateway = dask_gateway.GatewayCluster()
cluster = gateway.new_cluster()

Anything else we need to know?:

Environment:

  • GKE cluster
  • daskhub chart version: 4.5.4
  • image: custom image running the following
    • Dask version: 2.30.0
    • Python version: 3.8.6
    • Operating System: ubuntu
    • Install method (conda, pip, source): conda
@jcrist
Copy link
Member

jcrist commented Nov 10, 2020

Hmmm, that's odd. The command passed has an empty string, which should be forwarded correctly through k8s and still parse correctly (the pod spec takes a list of args, and "" is a valid arg). We also have tests for this, and our k8s tests passed fine with these changes. This may be an esoteric bug, possible culprits:

  • The shell used in your image differs from the shell in our images, and that matters for some reason?
  • Our default images use tini as an entrypoint, perhaps this matters for some reason?
  • Perhaps a k8s difference/bug? I'm not sure what all touches the container args before things start up.

If you try things with the default images do they work fine? That would help figure out whether the images are the culprit or if it's a k8s difference.

Switching the --host arg to to "0.0.0.0" instead might work, but I'm curious why things pass with our images and fail with yours.

(note - I'm off for the next week, so responses here may be a bit slow)

@bolliger32
Copy link
Author

Interesting. Thanks for the detailed set of possible causes! Using the default image for the scheduler doesn't seem to change anything, so i doubt that's the error. I will see if it matters if I switch to the default image for the notebook (it's not currently in our list of options so will take a sec to update the helm chart and add it).

The k8s version is 1.16.13 - not sure how that compares to whats being used in your tests.

To add to the confusion, I sometimes don't get that detailed output from the dask-scheduler command showing up in the logs. Most of the time it just returns relatively little info:
image

@bolliger32
Copy link
Author

@jcrist I take it back! I was not changing my cluster options correctly and hadn't actually switched images. When I use the default image it works! I'll dig into the couple of things you mentioned to figure out why our image is causing this behavior

@bolliger32
Copy link
Author

In the end, I just decided to go with the default scheduler and use our customized image just on the worker. Seems to work if the important packages are pinned to the same version across these images. I'm sure it wouldn't take too much digging to figure out what was going on. Let me know if you think that would be helpful and I'm happy to give it a bit more effort. Otherwise, I think we can close this

@jcrist
Copy link
Member

jcrist commented Nov 11, 2020

I'd really like to figure out what's going on here. Can you post the Dockerfile for your failing image (or an equivalent one that still fails)? That would be helpful in figuring out what the issue is.

@bolliger32
Copy link
Author

This is what we were using for a Dockerfile. The conda env referenced in the file is here.

Note that some commands in the prepare.sh script run by tini fail when run on this scheduler image. This is an artifact of also using that startup script for our worker images, which are much heavier weight and were taking too long to spin up as the scheduler. This did not cause issues with the previous helm chart but could be part of the reason we were getting the issue with the dask-scheduler command.

@consideRatio
Copy link
Collaborator

This patch version upgrade of daskhub was an upgrade from dask-gateway 0.8.0 to 0.9.0 and I wonder if JupyterHub also introduced some network policies etc by default between 0.9.1 and 0.10.X. If that's the case, they suddenly require #352 which hasn't been released yet.

Chart Jupyterhub Dask-gateway Dask Date
daskhub-4.5.4 0.10.2 0.9.0 2.30.0 06 April 2021
daskhub-4.5.3 0.9.1 0.8.0 2.30.0 06 April 2021

Was JupyterHub's network policy at hub.networkPolicy.enabled explicitly set to false in this deployment? Otherwise it would be set to true by default, and then that is one failure that would happen fore sure even though I'm not sure it would show the same symptoms like this.

If so, this is a duplicated of #360, closed by #352, but awaiting #381.

@bolliger32
Copy link
Author

@consideRatio hmm...it's been awhile but I can confirm that something between nov 2020 and now has solved this issue and we are now using our custom scheduler images (with daskhub 2021.8.1). I'm pretty certain you are right - we definitely were not explicitly disabling the hub.networkPolicy setting. We are doing that now (actually have implemented something similar to #352 in our values.yml file) and are able to use the custom image fine. I'm not sure if this was the underlying issue, but regardless - this is no longer a problem. I'd be in favor of closing the issue unless anyone would like to leave it open.

@consideRatio
Copy link
Collaborator

Thanks for following up @bolliger32!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants