-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try model training in the cloud #41
Comments
The steps in a new instance would be (I guess):
|
Btw, since all this is happening in the backend, we can deploy the instance in whatever location is cheapest. |
After having a look at both AMIs, it seemed at first that the Conda-based AMI was thought for Jupyter notebooks related work and that for our purposes the Base AMI would do better, but in the end we needed to use the Conda-based AMI as it looks like some additional dependencies would be needed in our So instead of trying to find out which ones were missing, we decided to go with the Conda-based AMI, and we're using the Ubuntu flavour (Deep Learning AMI (Ubuntu 18.04) Version 25.2 - ami-063690c75d69a8f15) instead of the Amazon Linux one (Deep Learning AMI (Amazon Linux 2) Version 25.0 - ami-08e01b26c47f98d6b), as we're used to it. Once the instance is launched, we've followed this steps:
|
This is the output of the training of the model. It seems like we got some performance improvement when training the model here (at least for just one epoch) compared to the execution in our laptops (where it takes around one hour):
|
My laptop does 11 samples/second, so the ~12.5 samples/second in AWS is not a huge improvement. :/ One hour of training was with a slightly bigger corpus, this one in my laptop takes ~40 minutes. I actually thought maybe the GPU was not being used in the first test, so I wanted to redo it and see if we could configure something.I used the same AMI but, for some reason, I couldn't select the g4dn.xlarge instance type, so I went with a p2.xlarge one, which is more expensive ($0.90/hour). (Ok, we looked into this: if we try to start the AMI selecting it from the Marketplace, the g4 is not an option. But if we search for "Deep Learning" in the AMI list and select the Ubuntu 18.04 one without opening the list from the Marketplace, then the g4 is an option. 🤷♂ ) Got the same errors when installing the dependencies in the p2.xlarge. Performance is almost double, but so is the cost:
While training, I ran
I then tried storing the embeddings in the GPU memory, as suggested in the Flair documentation, but I saw no significant change (after that initial huge samples/sec figure, which I don't think means anything because training hasn't started):
|
Let's do some additional fine-tuning of the p2.xlarge instance. In particular, let's change the With 256, CUDA runs out of memory. With 128, and embeddings stored in the GPU, the performance is much worse:
Same performance if we store the embeddings in the CPU. By the way,
Performance keeps improving with 64 as batch size:
Batch size 32, which was our original test, performance is as expected:
16 as batch size:
8:
4, this is the best performance, a 2x improvement:
If we decrease the batch size to 1, performance suffers:
I took a look at Once we seem to have found the best batch size for our instance type, 4, we can try other parameters. Storing the embeddings in the GPU may have some effect, from ~44 to ~46, but it's hard to say for sure. In any case, it's not worse:
Setting
We try setting
Finally, we try the fine-tuning suggested by Amazon itself regarding GPU clock rate and autoboost:
And it seems to have an effect, another 5% or so. Why isn't this the default? 🤷♂
We now get 5x performance vs my 2013 MacBook Pro, which is starting to be reasonable I guess. At a cost of 0.9$/hour. |
We are going to try training in a Google Cloud instance, just to do a quick comparison of cost/speed. I got the 300$ credits for new accounts. For some reason, I can't deploy an instance with a GPU until I raise my "GPU quota", which may take up to two business days. 🤷♂ Request sent. ... We've got the quota now, took only 1-2 hours, but trying to deploy both in
I was finally successful with
I saw pages describing how to compile Python 3.6 from source but it sounded long and boring. I tried upgrading to Debian 10, which seemed to work, even if I had to answer lots of scary questions about modified config files, but then ... Second attempt. We install Anaconda downloading the shell script from its page and create a Python 3 envronment:
We then install Verba, using static copies of the repository and the corpus to avoid setting up Github tokens, this is just a test:
There was an error when trying to install The performance with batch size 32 is not impressive, but note it's not detecting the GPU, the Device is showing "cpu":
So we reinstall the Nvidia driver using the command that appears when logging in,
An old driver?! We got to the Nvidia site and get the latest for CUDA 10.1 (important, 10.0 didn't work!) for our Tesla K80:
The installation is quite different from the old driver: it's graphical and asks a few more questions. But finally, it works:
Training performance with batch size 4 is good:
With 32, not so much:
|
I've moved the training code into a Jupyter notebook to test it under Google Colab. The code doesn't need to be changed, and installing the dependencies is just Setting the runtime with an additional GPU as accelerator, using the initial batch size of 32, we see similar performance to the small g4 instance, ~11 samples/second:
Reducing the batch size to 4 increases performance, as we saw in the p2 Amazon instance, up to a very good 30 samples/second:
If I switch the runtime to use a TPU instead of a GPU, the performance is much worse. It seems like the TPU is not detected automatically and has to be configured manually, and I don't know how to do that with Flair, so whatever, I don't think it's a priority right now. With batch size 4:
Not hugely different with 32:
|
In terms of cost and performance, GCE vs AWS, we have:
For reference:
People who train models regularly seem to use instances with more CPU power, e.g. Manuel Garrido mentions the g3.4xlarge (16 vCPUs, 122GiB RAM, Nvidia M60 GPU), and Victor Peinado in GCE uses the n1-highmem-8 (8 vCPUs, 52 GB RAM and a Nvidia V100 GPU). It's quite probable that we're CPU-bound in our tests: when we reduce the batch size for optimal performance the GPU is not fully utilised, and we can see the Python process taking 100% of CPU. When we increase the batch size we're probably increasing thrashing and context switching in the CPU, hence the bad performance. |
We've learnt quite a lot here, enough for now to train our models. |
As the corpus of #12 grows, and we start using the full article text, instead of just its title, the training gets really slow, and a laptop is not the ideal way. How much faster is the training with a proper GPU? Probably a lot.
AWS offers many different types of servers, of which the G4 is probably the most suitable for us: it has an Nvidia T4 card, with specialized tensor units blah blah. The g4dn.xlarge is not too expensive, $0.526/hour, and is a good way of testing whether Flair/PyTorch actually leverages the card.
In order to run our training, Amazon offers a series of pre-built images with all the relevant deep learning packages already installed. Some more detailed instructions here. We should try deploying one of these AMIs, cloning our repo, fetching the corpus and running the training for a few epochs, just to get a first rought estimate of cost/speed.
The text was updated successfully, but these errors were encountered: