-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up GPU instances on the jupytearth hub #77
Comments
@fperez will do! To avoid adding more complexity than needed, it would be know to have some guesswork on what kinds of computers could make sense to have GPU to support the @espg's relevant workflows. It can be one starting point to generalize a bit around.
|
|
@consideRatio @fperez sorry for the delay on this; I wanted to do some empirical profiling before getting back with an answer, but I'm getting gummed up on with some of the libraries, so I'll try and scope out the conceptual layout of the pipeline and what I think would be a good start. The GPU's are just convolving kernels on images, so GPU size (in RAM) is effectively matched to the image size. For now, to start, images are small (~15mb), but there are quite a lot of them-- in general ten's of thousands per 'day', with days running into the hundreds. The convolutions are effectively instant on all the GPU's I've used before, with the largest latency coming from a.) loading the images from the internet to disk, b.) loading them from disk to memory, and c.) loading them to and from the GPU's to run the convolution. I think a good place to start would be with the G4dn single GPU VM instances. I don't have a good idea on the scaling for the CPU's that are needed to saturate; we want lots of workers hitting a single GPU... but I'm not sure if 'a lot' is 16 workers or 64. For non-GPU RAM, I expect scaling will follow with the number of workers, so any of the G4 instances will work. Is it possible to setup more than one instance type, if they're all the same 'family' of instance to check on the scaling? I think the sweet spot is either on the low end around 8 or 16 cores, or the high end with 64-cores. I'd have to pre-load an image set to a directory, and see how many workers it takes to 'saturate' the GPU to know the scaling though. The software side of this is pretty basic-- just cupy and dependencies (a cuda-friendly version of numba for jit'ing is probably in there too). |
Thanks for input about this! I've made some progress on this but we need to await that AWS increase the allowed quota so that we can use GPUs. I'm exposing 4 CPU/16 GB and 16 CPU/64 GB nodes with a single T4 Tensor Core GPU attached. Note that if dask is to be involved using this GPU, a LocalCluster should be used because:
*Well it is, but it's a hack I think will be unreliable and would advice against us using. Also, when the GPU is used by multiple workers in a local dask cluster, they may need to have self-imposed restrictions so they don't make each other crash by hogging too much memory and running out of it. The case opened can be tracked here: https://console.aws.amazon.com/support/home#/case/?displayId=8818400181 |
@consideRatio ok, thanks for getting this rolling! I can see the new instance types on the hub, but they aren't able to initiate yet. I'm assuming that's what the case reference that you posted is about... perhaps a basic question, but do you know how to find out my IAM username/password for the hub so I can check out that link? I can see the account number and what looks like a possible username if I run |
Shane, I think I need to change things in your AWS privileges to see those messages... I get a bit lost there so we can do this later, happy to give you access, it's just that each time I do it it takes me some time navigating the AWS console... But the msg from AWS just says "we got your request, looking into it, will let you know" |
@espg there are three accounts involved:
Nobody besides me, having done some work to setup a k8s cluster and a s3 bucket for example, should need direct AWS access unless there is a need to manage some resources like that. Let me know if you want access from a learning perspective or similar though and I can grant it. Btw @espg, I've prepared already by installing jupyter-earth/hub.jupytearth.org-image/Dockerfile Lines 217 to 218 in b77bbb9
|
@espg I think this is resolved on my end - you should be able to start a server with GPU attached, have you trialed if this worked out for you? |
@espg reported it did not work even though it worked for me - this was likely because a configuration i made got reset. I think I know why the configuration I made was reset, and have made sure its configured correctly now again - in the future I will make sure to avoid having it reset. Technical notes
# the tolerations I manually add via
# kubectl edit daemonset -n kube-system nvidia-device-plugin-daemonset
- effect: NoSchedule
key: hub.jupyter.org_dedicated
operator: Equal
value: user |
@consideRatio - I'm not sure if you've looked yet into the details of setting up GPU instances for our hub. We have some workloads that are starting to need GPUs (cc @espg), and it would be great to have a couple of options to play with.
Would you mind taking a look at the options? We can document here our process/choices... Thanks!
The text was updated successfully, but these errors were encountered: