Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up deployment #1

Open
connolly opened this issue Dec 10, 2019 · 1 comment
Open

Speed up deployment #1

connolly opened this issue Dec 10, 2019 · 1 comment
Labels
important Important to address

Comments

@connolly
Copy link
Member

Current deployment from login to spark cluster is about 10 mins (5 mins to deploy the EC2 VMs and 5 mins to deploy the pod). Break down of timing from @stevenstetzler is

Mainly it's the allocation and creation of EC2 virtual machines. The workflow is: (1) request resource in Kubernetes -> (2) Kubernetes scheduler tries to schedule new pods (Jupyter notebook or spark executor) -> (3) if need more nodes to accommodate the pod, cluster autoscaler ask for new nodes from AWS -> (4) AWS creates N more virtual machines to accommodate request -> (5) Once virtual machines are up, pods get placed on them, docker images get pulled and docker containers start on those machines -> (6) in either case of Jupyter or Spark, the pod that asked for the new pods to be created sends a ping to check if the new pods are ready (in spark this is when you see the new executors added in the job timeline, when the executor pod pings back "I'm alive").
(1) and (2) is almost instant, but I imagine will get slower as the Kubernetes cluster is used more (probably not but much)
(3) can take some time depending on the load on the cluster autoscaler. Sometimes up to a minute, but is usually on the order of tens of seconds.
(4) is the main bottleneck. Try creating a new EC2 virtual machine and see how fast it is, on the order of minutes.
(5) this depends on how large our images are, the remoteness of the docker repository (Docker Hub vs AWS ECR for example, are the images on site or not), and network speed of the nodes those containers are sitting on. On the order of tens of seconds to a minute.
(6) can take a second or a while based on what scripts run at container start up. Right now order of seconds.

@mjuric
Copy link
Member

mjuric commented Dec 12, 2019

(note: copying my comments from the Slack thread)

It's tough to speed it up w/o preallocating a few machines (i.e., set up the autoscaler to always keep a buffer of ~N free machines, to be immediately available when the next user(s) connect). But that costs money.

A workaround is to warn the user that the cluster will take 10 minutes to spin up. They'll be less annoyed if they're aware of this (and will incorporate the delay into the workflow).

One thing to look into: Fargate -- https://aws.amazon.com/fargate/ -- this is supposed to enable one to run containers w/o specifying a server on which to run them. I'm not sure if that means the spinup is faster. The thing to look at is how they intract with EKS -- amazon just announced the tie-in on reInvent, but I didn't get a chance to read about it. It does look potentially promising:

@mjuric mjuric transferred this issue from astronomy-commons/genesis-jupyterhub-automator Dec 13, 2019
@mjuric mjuric added the important Important to address label Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
important Important to address
Projects
None yet
Development

No branches or pull requests

2 participants