Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try model training in the cloud #41

Closed
dcabo opened this issue Nov 5, 2019 · 10 comments
Closed

Try model training in the cloud #41

dcabo opened this issue Nov 5, 2019 · 10 comments
Assignees
Labels

Comments

@dcabo
Copy link
Member

dcabo commented Nov 5, 2019

As the corpus of #12 grows, and we start using the full article text, instead of just its title, the training gets really slow, and a laptop is not the ideal way. How much faster is the training with a proper GPU? Probably a lot.

AWS offers many different types of servers, of which the G4 is probably the most suitable for us: it has an Nvidia T4 card, with specialized tensor units blah blah. The g4dn.xlarge is not too expensive, $0.526/hour, and is a good way of testing whether Flair/PyTorch actually leverages the card.

In order to run our training, Amazon offers a series of pre-built images with all the relevant deep learning packages already installed. Some more detailed instructions here. We should try deploying one of these AMIs, cloning our repo, fetching the corpus and running the training for a few epochs, just to get a first rought estimate of cost/speed.

@dcabo
Copy link
Member Author

dcabo commented Nov 5, 2019

The steps in a new instance would be (I guess):

  • Start the Amazon Deep Learning AMI with Conda installed. This one?
  • Activate the PyTorch / Python 3 environment.
  • Clone the civio/telelediario repo.
  • Install the Python dependencies, pip install -r requirements.txt.
  • Copy the corpus from here and uncompress under test-classification/corpus.
  • Run python3 train.py in the test-classification folder. (Or maybe it's python, depending on how the AMI is set up.)
  • The results will be under resources/taggers/rtve_topics. We don't care so much now about the huge model files, best-model.pt and final-model.pt, we are just performance testing this thing.

@dcabo
Copy link
Member Author

dcabo commented Nov 5, 2019

Btw, since all this is happening in the backend, we can deploy the instance in whatever location is cheapest.

@esebastian
Copy link
Contributor

After having a look at both AMIs, it seemed at first that the Conda-based AMI was thought for Jupyter notebooks related work and that for our purposes the Base AMI would do better, but in the end we needed to use the Conda-based AMI as it looks like some additional dependencies would be needed in our requirements.txt that are already present in the Conde-based AMI and not in the Base AMI.

So instead of trying to find out which ones were missing, we decided to go with the Conda-based AMI, and we're using the Ubuntu flavour (Deep Learning AMI (Ubuntu 18.04) Version 25.2 - ami-063690c75d69a8f15) instead of the Amazon Linux one (Deep Learning AMI (Amazon Linux 2) Version 25.0 - ami-08e01b26c47f98d6b), as we're used to it.

Once the instance is launched, we've followed this steps:

  • Copy the id_rsa file with the private key we're using for the civio-bot ([email protected]) GitHub user

    > scp -i us-east-1.pem id_rsa [email protected]:.ssh
    
  • Connect with the instance via SSH (the IP will change once we stop the instance):

    > ssh -i us-east-1.pem [email protected]
    
  • Clone the civio/telelediario repo in the home folder:

    $ git clone [email protected]:civio/telelediario.git
    
  • Ensure all Python dependencies are installed:

    $ cd telelediario
    $ source activate pytorch_p36
    (pytorch_p36)$ pip install -r requirements.txt
    

    While installing the dependencies a pair of errors regarding some transient dependencies arises, but it seems that there aren't any further consequences regarding that:

    fastai 1.0.58 requires nvidia-ml-py3, which is not installed.
    sparkmagic 0.12.5 has requirement ipython<7,>=4.0.2, but you'll have ipython 7.6.1 which is incompatible.
    jupyter-console 5.2.0 has requirement prompt_toolkit<2.0.0,>=1.0.0, but you'll have prompt-toolkit 2.0.10 which is incompatible.
    
  • Download the corpus and uncompress the data:

    (pytorch_p36)$ cd test-classification
    (pytorch_p36)$ wget https://www.dropbox.com/s/hdd34pebte2cf77/corpus.zip?dl=0  -O corpus.zip
    (pytorch_p36)$ unzip corpus.zip
    
  • Train the model:

    (pytorch_p36)$ python train.py
    

@esebastian
Copy link
Contributor

esebastian commented Nov 5, 2019

This is the output of the training of the model. It seems like we got some performance improvement when training the model here (at least for just one epoch) compared to the execution in our laptops (where it takes around one hour):

2019-11-05 19:08:16,774 Reading data from corpus
2019-11-05 19:08:16,774 Train: corpus/train.txt
2019-11-05 19:08:16,774 Dev: corpus/dev.txt
2019-11-05 19:08:16,774 Test: corpus/test.txt
2019-11-05 19:08:17,170 Computing label dictionary. Progress:
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 25176/25176 [00:05<00:00, 4546.88it/s]
2019-11-05 19:08:23,134 [b'Deportes', b'RTVE', b'Noticias']
100%|██████████████████████████████████████████████████████████████████████████████████████████| 995526/995526 [00:00<00:00, 80408609.52B/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 521/521 [00:00<00:00, 794629.96B/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 714314041/714314041 [00:12<00:00, 58063599.03B/s]
2019-11-05 19:10:14,558 ----------------------------------------------------------------------------------------------------
2019-11-05 19:10:14,560 Model: "TextClassifier(
...
2019-11-05 19:10:14,560 ----------------------------------------------------------------------------------------------------
2019-11-05 19:10:14,560 Corpus: "Corpus: 25176 train + 3089 dev + 3011 test sentences"
2019-11-05 19:10:14,560 ----------------------------------------------------------------------------------------------------
2019-11-05 19:10:14,560 Parameters:
2019-11-05 19:10:14,560  - learning_rate: "0.1"
2019-11-05 19:10:14,560  - mini_batch_size: "32"
2019-11-05 19:10:14,560  - patience: "5"
2019-11-05 19:10:14,560  - anneal_factor: "0.5"
2019-11-05 19:10:14,560  - max_epochs: "1"
2019-11-05 19:10:14,560  - shuffle: "True"
2019-11-05 19:10:14,560  - train_with_dev: "False"
2019-11-05 19:10:14,560  - batch_growth_annealing: "False"
2019-11-05 19:10:14,560 ----------------------------------------------------------------------------------------------------
2019-11-05 19:10:14,560 Model training base path: "resources/taggers/rtve_topics"
2019-11-05 19:10:14,560 ----------------------------------------------------------------------------------------------------
2019-11-05 19:10:14,561 Device: cuda:0
2019-11-05 19:10:14,561 ----------------------------------------------------------------------------------------------------
2019-11-05 19:10:14,561 Embeddings storage mode: cpu
2019-11-05 19:10:14,562 ----------------------------------------------------------------------------------------------------
2019-11-05 19:10:22,379 epoch 1 - iter 0/787 - loss 1.15675247 - samples/sec: 327.57
2019-11-05 19:13:39,994 epoch 1 - iter 78/787 - loss 1.19370936 - samples/sec: 12.65
2019-11-05 19:17:01,396 epoch 1 - iter 156/787 - loss 0.87264143 - samples/sec: 12.42
2019-11-05 19:20:19,916 epoch 1 - iter 234/787 - loss 0.73360590 - samples/sec: 12.60
2019-11-05 19:23:42,711 epoch 1 - iter 312/787 - loss 0.63344293 - samples/sec: 12.34
2019-11-05 19:27:00,210 epoch 1 - iter 390/787 - loss 0.57319642 - samples/sec: 12.67
2019-11-05 19:30:19,503 epoch 1 - iter 468/787 - loss 0.52592247 - samples/sec: 12.56
2019-11-05 19:33:39,768 epoch 1 - iter 546/787 - loss 0.49509478 - samples/sec: 12.50
2019-11-05 19:36:58,838 epoch 1 - iter 624/787 - loss 0.46927108 - samples/sec: 12.56
2019-11-05 19:40:17,576 epoch 1 - iter 702/787 - loss 0.44567947 - samples/sec: 12.59
2019-11-05 19:43:33,929 epoch 1 - iter 780/787 - loss 0.42430900 - samples/sec: 12.74
2019-11-05 19:43:48,521 ----------------------------------------------------------------------------------------------------
2019-11-05 19:43:48,521 EPOCH 1 done: loss 0.4234 - lr 0.1000
2019-11-05 19:47:57,867 DEV : loss 0.21480277180671692 - score 0.9252
2019-11-05 19:47:58,386 BAD EPOCHS (no improvement): 0
2019-11-05 19:48:02,085 ----------------------------------------------------------------------------------------------------
2019-11-05 19:48:02,085 Testing using best model ...
2019-11-05 19:48:02,086 loading file resources/taggers/rtve_topics/best-model.pt
2019-11-05 19:52:06,447 0.9362	0.9362	0.9362
2019-11-05 19:52:06,447
MICRO_AVG: acc 0.8801 - f1-score 0.9362
MACRO_AVG: acc 0.8505 - f1-score 0.9178333333333333
Deportes   tp: 661 - fp: 26 - fn: 58 - tn: 2266 - precision: 0.9622 - recall: 0.9193 - accuracy: 0.8872 - f1-score: 0.9403
Noticias   tp: 1796 - fp: 151 - fn: 33 - tn: 1031 - precision: 0.9224 - recall: 0.9820 - accuracy: 0.9071 - f1-score: 0.9513
RTVE       tp: 362 - fp: 15 - fn: 101 - tn: 2533 - precision: 0.9602 - recall: 0.7819 - accuracy: 0.7573 - f1-score: 0.8619
2019-11-05 19:52:06,448 ----------------------------------------------------------------------------------------------------

@dcabo
Copy link
Member Author

dcabo commented Nov 6, 2019

My laptop does 11 samples/second, so the ~12.5 samples/second in AWS is not a huge improvement. :/ One hour of training was with a slightly bigger corpus, this one in my laptop takes ~40 minutes.

I actually thought maybe the GPU was not being used in the first test, so I wanted to redo it and see if we could configure something.I used the same AMI but, for some reason, I couldn't select the g4dn.xlarge instance type, so I went with a p2.xlarge one, which is more expensive ($0.90/hour). (Ok, we looked into this: if we try to start the AMI selecting it from the Marketplace, the g4 is not an option. But if we search for "Deep Learning" in the AMI list and select the Ubuntu 18.04 one without opening the list from the Marketplace, then the g4 is an option. 🤷‍♂ )

Got the same errors when installing the dependencies in the p2.xlarge. Performance is almost double, but so is the cost:

019-11-06 00:13:50,408 Device: cuda:0
2019-11-06 00:13:50,408 ----------------------------------------------------------------------------------------------------
2019-11-06 00:13:50,408 Embeddings storage mode: cpu
2019-11-06 00:13:50,410 ----------------------------------------------------------------------------------------------------
2019-11-06 00:14:05,243 epoch 1 - iter 0/787 - loss 1.13336742 - samples/sec: 171.82
2019-11-06 00:15:57,520 epoch 1 - iter 78/787 - loss 1.21229839 - samples/sec: 22.35
2019-11-06 00:17:51,000 epoch 1 - iter 156/787 - loss 0.85693566 - samples/sec: 22.14

While training, I ran nvidia-smi to monitor GPU usage, and it looked fine: we have a Tesla K80 at ~60-70% utilisation, so Flair/PyTorch is correctly detecting the GPU and using it, as should:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   53C    P0    87W / 149W |   3426MiB / 11441MiB |     68%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2988      C   python                                      3415MiB |
+-----------------------------------------------------------------------------+

I then tried storing the embeddings in the GPU memory, as suggested in the Flair documentation, but I saw no significant change (after that initial huge samples/sec figure, which I don't think means anything because training hasn't started):

2019-11-06 00:18:48,407 Device: cuda:0
2019-11-06 00:18:48,407 ----------------------------------------------------------------------------------------------------
2019-11-06 00:18:48,407 Embeddings storage mode: gpu
2019-11-06 00:18:48,408 ----------------------------------------------------------------------------------------------------
2019-11-06 00:18:50,141 epoch 1 - iter 0/787 - loss 1.32896566 - samples/sec: 1593.03
2019-11-06 00:20:42,854 epoch 1 - iter 78/787 - loss 1.17365696 - samples/sec: 22.29
2019-11-06 00:22:33,357 epoch 1 - iter 156/787 - loss 0.83682916 - samples/sec: 22.73

@dcabo
Copy link
Member Author

dcabo commented Nov 6, 2019

Let's do some additional fine-tuning of the p2.xlarge instance. In particular, let's change the mini_batch_size of the model, which in the original test, based on a Flair example, was set to 32:

With 256, CUDA runs out of memory.

With 128, and embeddings stored in the GPU, the performance is much worse:

2019-11-06 12:29:37,341 epoch 1 - iter 0/197 - loss 1.42595005 - samples/sec: 137.73
2019-11-06 12:34:50,667 epoch 1 - iter 19/197 - loss 1.64276018 - samples/sec: 7.77
2019-11-06 12:40:00,134 epoch 1 - iter 38/197 - loss 1.28851641 - samples/sec: 7.87

Same performance if we store the embeddings in the CPU. By the way, nvidia-smi still reports GPU memory being used, I couldn't see much difference, but I didn't look really into it.

2019-11-06 12:41:27,804 Embeddings storage mode: cpu
2019-11-06 12:41:27,805 ----------------------------------------------------------------------------------------------------
2019-11-06 12:41:46,261 epoch 1 - iter 0/197 - loss 1.20292759 - samples/sec: 136.05
2019-11-06 12:47:00,316 epoch 1 - iter 19/197 - loss 1.71606332 - samples/sec: 7.76

Performance keeps improving with 64 as batch size:

2019-11-06 12:53:39,296 Embeddings storage mode: cpu
2019-11-06 12:53:39,297 ----------------------------------------------------------------------------------------------------
2019-11-06 12:53:44,199 epoch 1 - iter 0/394 - loss 1.14551151 - samples/sec: 547.40
2019-11-06 12:56:40,970 epoch 1 - iter 39/394 - loss 1.48501664 - samples/sec: 14.16

Batch size 32, which was our original test, performance is as expected:

2019-11-06 12:49:13,155 Embeddings storage mode: cpu
2019-11-06 12:49:13,156 ----------------------------------------------------------------------------------------------------
2019-11-06 12:49:14,559 epoch 1 - iter 0/787 - loss 1.10012352 - samples/sec: 2020.56
2019-11-06 12:51:01,672 epoch 1 - iter 78/787 - loss 1.15864682 - samples/sec: 23.42
2019-11-06 12:52:49,198 epoch 1 - iter 156/787 - loss 0.84316317 - samples/sec: 23.33

16 as batch size:

2019-11-06 12:58:36,166 Embeddings storage mode: cpu
2019-11-06 12:58:36,168 ----------------------------------------------------------------------------------------------------
2019-11-06 12:58:37,170 epoch 1 - iter 0/1574 - loss 1.05729330 - samples/sec: 4563.19
2019-11-06 12:59:54,810 epoch 1 - iter 157/1574 - loss 1.02201880 - samples/sec: 32.57
2019-11-06 13:01:12,774 epoch 1 - iter 314/1574 - loss 0.75182536 - samples/sec: 32.45

8:

2019-11-06 13:01:58,183 Embeddings storage mode: cpu
2019-11-06 13:01:58,185 ----------------------------------------------------------------------------------------------------
2019-11-06 13:01:58,616 epoch 1 - iter 0/3147 - loss 1.04104173 - samples/sec: 9511.86
2019-11-06 13:03:04,240 epoch 1 - iter 314/3147 - loss 1.00071794 - samples/sec: 38.62
2019-11-06 13:04:10,995 epoch 1 - iter 628/3147 - loss 0.80355138 - samples/sec: 37.94

4, this is the best performance, a 2x improvement:

2019-11-06 13:05:25,257 epoch 1 - iter 0/6294 - loss 0.69491398 - samples/sec: 21814.22
2019-11-06 13:06:23,742 epoch 1 - iter 629/6294 - loss 1.05279895 - samples/sec: 43.48
2019-11-06 13:07:21,159 epoch 1 - iter 1258/6294 - loss 0.89595981 - samples/sec: 44.22

If we decrease the batch size to 1, performance suffers:

2019-11-06 13:08:30,481 Embeddings storage mode: cpu
2019-11-06 13:08:30,482 ----------------------------------------------------------------------------------------------------
2019-11-06 13:08:30,672 epoch 1 - iter 0/25176 - loss 1.40525293 - samples/sec: 50816.19
2019-11-06 13:10:00,345 epoch 1 - iter 2517/25176 - loss 1.23911262 - samples/sec: 28.33

I took a look at nvidia-smi, in a very unscientific way, and the GPU seems to be around 50-60% when using a batch size of 32 or less, but saw 80-90% on bigger batch sizes. And GPU memory usage is way lower, down to 1.3MiB out of 11.5MiB, with small batches. Still, performance is better with smaller sizes. 🤷 All this doesn't seem to have any effect on my laptop performance, by the way.

Once we seem to have found the best batch size for our instance type, 4, we can try other parameters. Storing the embeddings in the GPU may have some effect, from ~44 to ~46, but it's hard to say for sure. In any case, it's not worse:

2019-11-06 14:20:08,056 Embeddings storage mode: gpu
2019-11-06 14:20:08,057 ----------------------------------------------------------------------------------------------------
2019-11-06 14:20:08,534 epoch 1 - iter 0/6294 - loss 1.64474583 - samples/sec: 7567.90
2019-11-06 14:21:03,870 epoch 1 - iter 629/6294 - loss 1.08841361 - samples/sec: 45.97
2019-11-06 14:21:59,124 epoch 1 - iter 1258/6294 - loss 0.91228417 - samples/sec: 45.95
2019-11-06 14:22:53,672 epoch 1 - iter 1887/6294 - loss 0.81415291 - samples/sec: 46.64
2019-11-06 14:23:48,130 epoch 1 - iter 2516/6294 - loss 0.76817472 - samples/sec: 46.63
2019-11-06 14:24:43,668 epoch 1 - iter 3145/6294 - loss 0.75128710 - samples/sec: 45.74

Setting checkpoint=False from True, which was needed in case we wanted to resume testing, doesn't seem to have a big effect (~45 from ~46 before, I think it's noise, there's no reason it would get slower disabling it), so we'll keep it on:

2019-11-06 14:27:55,861 Embeddings storage mode: gpu
2019-11-06 14:27:55,863 ----------------------------------------------------------------------------------------------------
2019-11-06 14:27:56,279 epoch 1 - iter 0/6294 - loss 1.63443124 - samples/sec: 17409.78
2019-11-06 14:28:54,008 epoch 1 - iter 629/6294 - loss 1.03466286 - samples/sec: 43.99
2019-11-06 14:29:50,403 epoch 1 - iter 1258/6294 - loss 0.87448108 - samples/sec: 45.04
2019-11-06 14:30:46,756 epoch 1 - iter 1887/6294 - loss 0.78426998 - samples/sec: 45.13

We try setting shuffle to False. We hadn't specified anything before, it defaults to True it seems. I don't really know what it does, and it doesn't seem to have a huge impact on performance, so we'll leave it as True, the default:

2019-11-06 14:32:31,205 Embeddings storage mode: gpu
2019-11-06 14:32:31,206 ----------------------------------------------------------------------------------------------------
2019-11-06 14:32:31,460 epoch 1 - iter 0/6294 - loss 1.71518946 - samples/sec: 22755.71
2019-11-06 14:33:26,920 epoch 1 - iter 629/6294 - loss 1.10240182 - samples/sec: 45.89
2019-11-06 14:34:21,253 epoch 1 - iter 1258/6294 - loss 0.88940667 - samples/sec: 46.76
2019-11-06 14:35:15,890 epoch 1 - iter 1887/6294 - loss 0.80563324 - samples/sec: 46.49

Finally, we try the fine-tuning suggested by Amazon itself regarding GPU clock rate and autoboost:

$ sudo nvidia-smi --auto-boost-default=0
All done.
$ sudo nvidia-smi -ac 2505,875
Applications clocks set to "(MEM 2505, SM 875)" for GPU 00000000:00:1E.0
All done.

And it seems to have an effect, another 5% or so. Why isn't this the default? 🤷‍♂

2019-11-06 14:36:35,209 ----------------------------------------------------------------------------------------------------
2019-11-06 14:36:35,445 epoch 1 - iter 0/6294 - loss 1.18829226 - samples/sec: 29413.78
2019-11-06 14:37:27,670 epoch 1 - iter 629/6294 - loss 1.10227158 - samples/sec: 48.78
2019-11-06 14:38:19,789 epoch 1 - iter 1258/6294 - loss 0.89232600 - samples/sec: 48.76
2019-11-06 14:39:11,819 epoch 1 - iter 1887/6294 - loss 0.84335122 - samples/sec: 48.85

We now get 5x performance vs my 2013 MacBook Pro, which is starting to be reasonable I guess. At a cost of 0.9$/hour.

@dcabo
Copy link
Member Author

dcabo commented Nov 6, 2019

We are going to try training in a Google Cloud instance, just to do a quick comparison of cost/speed. I got the 300$ credits for new accounts. For some reason, I can't deploy an instance with a GPU until I raise my "GPU quota", which may take up to two business days. 🤷‍♂ Request sent.

...

We've got the quota now, took only 1-2 hours, but trying to deploy both in us-east-1 and europe-west1-b results in an error:

{
"ResourceType":"compute.v1.instance",
"ResourceErrorCode":"ZONE_RESOURCE_POOL_EXHAUSTED",
"ResourceErrorMessage":"The zone 'projects/verba-258215/zones/europe-west1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later."
}

I was finally successful with europe-west1-d. I logged in via the browser, since I didn't see a way to connect via normal SSH without installing Google stuff. Then I saw the image comes with Debian 9 and Python 3.5.3 (!), which is not recent enough for Flair:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.11 (stretch)
Release:        9.11
Codename:       stretch

I saw pages describing how to compile Python 3.6 from source but it sounded long and boring. I tried upgrading to Debian 10, which seemed to work, even if I had to answer lots of scary questions about modified config files, but then pip3 kept failing complaining about a missing internal module. 🤷‍♂ So I destroyed the instance and started from scratch, upgrading to Debian 10 is too complex.

...

Second attempt. We install Anaconda downloading the shell script from its page and create a Python 3 envronment:

$ conda create -n mypython3 python=3
$ source activate mypython3

We then install Verba, using static copies of the repository and the corpus to avoid setting up Github tokens, this is just a test:

$ wget https://www.dropbox.com/s/0ir9zt5h0mbet54/telelediario.zip?dl=0 -O telelediario.zip
$ unzip telelediario.zip
$ cd telelediario
$ pip install -r requirements.txt
$ cd test-classification
$ wget https://www.dropbox.com/s/hdd34pebte2cf77/corpus.zip?dl=0 -O corpus.zip
$ unzip corpus.zip
$ python train.py

There was an error when trying to install torch, as it couldn't find a recent-enough version. I saw this discussion, so I looked at the official Pytorch page and ran conda install pytorch torchvision -c pytorch. Note that this apparently downgraded Python from the 3.8 version in the environment created by Conda initially to 3.7. Whatever, it works now, finally.

The performance with batch size 32 is not impressive, but note it's not detecting the GPU, the Device is showing "cpu":

2019-11-06 17:49:43,495 Device: cpu
2019-11-06 17:49:43,496 ----------------------------------------------------------------------------------------------------
2019-11-06 17:49:43,496 Embeddings storage mode: gpu
2019-11-06 17:49:43,497 ----------------------------------------------------------------------------------------------------
2019-11-06 17:49:48,501 epoch 1 - iter 0/787 - loss 1.22800052 - samples/sec: 615.10
2019-11-06 17:54:13,368 epoch 1 - iter 78/787 - loss 1.28581877 - samples/sec: 9.45

nvidia-smi doesn't show any process running during training:

Wed Nov  6 17:57:34 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    79W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

So we reinstall the Nvidia driver using the command that appears when logging in, sudo /opt/deeplearning/install-driver.sh, but it doesn't change the situation. We try this simple test to make sure CUDA is fine, and it's not:

(mypython3) david_cabo@verba-5-vm:~$ python
Python 3.7.5 (default, Oct 25 2019, 15:51:11) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.current_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david_cabo/.conda/envs/mypython3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 386, in current_device
    _lazy_init()
  File "/home/david_cabo/.conda/envs/mypython3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 192, in _lazy_init
    _check_driver()
  File "/home/david_cabo/.conda/envs/mypython3/lib/python3.7/site-packages/torch/cuda/__init__.py", line 111, in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError: 
The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

An old driver?! We got to the Nvidia site and get the latest for CUDA 10.1 (important, 10.0 didn't work!) for our Tesla K80:

$ wget http://us.download.nvidia.com/tesla/418.87/NVIDIA-Linux-x86_64-418.87.01.run
$ chmod u+x NVIDIA*
$ sudo ./NVIDIA-Linux-x86_64-418.87.01.run

The installation is quite different from the old driver: it's graphical and asks a few more questions. But finally, it works:

>>> import torch
>>> torch.cuda.current_device()
0

Training performance with batch size 4 is good:

2019-11-06 18:26:28,564 Device: cuda:0
2019-11-06 18:26:28,564 ----------------------------------------------------------------------------------------------------
2019-11-06 18:26:28,564 Embeddings storage mode: gpu
2019-11-06 18:26:28,566 ----------------------------------------------------------------------------------------------------
2019-11-06 18:26:28,970 epoch 1 - iter 0/6294 - loss 1.47962570 - samples/sec: 12512.99
2019-11-06 18:27:32,065 epoch 1 - iter 629/6294 - loss 1.03310941 - samples/sec: 40.25
2019-11-06 18:28:32,771 epoch 1 - iter 1258/6294 - loss 0.88425283 - samples/sec: 41.81
2019-11-06 18:29:34,215 epoch 1 - iter 1887/6294 - loss 0.78575202 - samples/sec: 41.38

With 32, not so much:

2019-11-06 18:30:27,859 Device: cuda:0
2019-11-06 18:30:27,859 ----------------------------------------------------------------------------------------------------
2019-11-06 18:30:27,859 Embeddings storage mode: gpu
2019-11-06 18:30:27,861 ----------------------------------------------------------------------------------------------------
2019-11-06 18:30:29,715 epoch 1 - iter 0/787 - loss 1.50548244 - samples/sec: 1530.80
2019-11-06 18:32:31,427 epoch 1 - iter 78/787 - loss 1.16432978 - samples/sec: 20.62
2019-11-06 18:34:33,649 epoch 1 - iter 156/787 - loss 0.84494117 - samples/sec: 20.54
2019-11-06 18:36:35,296 epoch 1 - iter 234/787 - loss 0.69860660 - samples/sec: 20.62
2019-11-06 18:38:35,479 epoch 1 - iter 312/787 - loss 0.60444104 - samples/sec: 20.89
2019-11-06 18:40:37,419 epoch 1 - iter 390/787 - loss 0.55313267 - samples/sec: 20.59
2019-11-06 18:42:36,645 epoch 1 - iter 468/787 - loss 0.51158701 - samples/sec: 21.02

@dcabo
Copy link
Member Author

dcabo commented Nov 6, 2019

I've moved the training code into a Jupyter notebook to test it under Google Colab. The code doesn't need to be changed, and installing the dependencies is just !pip install flair. Uploading the files is a bit more of a pain, I ended up connecting my Google Drive, but it needs to be done every time, a few boring clicks...

Setting the runtime with an additional GPU as accelerator, using the initial batch size of 32, we see similar performance to the small g4 instance, ~11 samples/second:

2019-11-06 16:01:57,811 Device: cuda:0
2019-11-06 16:01:57,817 ----------------------------------------------------------------------------------------------------
2019-11-06 16:01:57,819 Embeddings storage mode: gpu
2019-11-06 16:01:57,822 ----------------------------------------------------------------------------------------------------
2019-11-06 16:02:02,413 epoch 1 - iter 0/787 - loss 1.12191868 - samples/sec: 630.55
2019-11-06 16:05:48,336 epoch 1 - iter 78/787 - loss 1.11327839 - samples/sec: 11.10

Reducing the batch size to 4 increases performance, as we saw in the p2 Amazon instance, up to a very good 30 samples/second:

2019-11-06 16:13:37,040 Device: cuda:0
2019-11-06 16:13:37,042 ----------------------------------------------------------------------------------------------------
2019-11-06 16:13:37,044 Embeddings storage mode: gpu
2019-11-06 16:13:37,049 ----------------------------------------------------------------------------------------------------
2019-11-06 16:13:37,731 epoch 1 - iter 0/6294 - loss 1.74821568 - samples/sec: 10901.13
2019-11-06 16:15:01,127 epoch 1 - iter 629/6294 - loss 0.93221317 - samples/sec: 30.41
2019-11-06 16:16:24,575 epoch 1 - iter 1258/6294 - loss 0.85967380 - samples/sec: 30.37
2019-11-06 16:17:48,669 epoch 1 - iter 1887/6294 - loss 0.79925003 - samples/sec: 30.13

If I switch the runtime to use a TPU instead of a GPU, the performance is much worse. It seems like the TPU is not detected automatically and has to be configured manually, and I don't know how to do that with Flair, so whatever, I don't think it's a priority right now. With batch size 4:

2019-11-06 16:20:54,494 Device: cpu
2019-11-06 16:20:54,495 ----------------------------------------------------------------------------------------------------
2019-11-06 16:20:54,495 Embeddings storage mode: gpu
2019-11-06 16:20:54,498 ----------------------------------------------------------------------------------------------------
2019-11-06 16:20:55,764 epoch 1 - iter 0/6294 - loss 1.51522887 - samples/sec: 2340.08 
2019-11-06 16:26:47,461 epoch 1 - iter 629/6294 - loss 1.03311608 - samples/sec: 7.17

Not hugely different with 32:

2019-11-06 16:28:07,594 Device: cpu
2019-11-06 16:28:07,595 ----------------------------------------------------------------------------------------------------
2019-11-06 16:28:07,596 Embeddings storage mode: gpu
2019-11-06 16:28:07,600 ----------------------------------------------------------------------------------------------------
2019-11-06 16:28:14,041 epoch 1 - iter 0/787 - loss 1.39594066 - samples/sec: 453.21
2019-11-06 16:33:15,774 epoch 1 - iter 78/787 - loss 1.22570608 - samples/sec: 8.29
2019-11-06 16:38:17,683 epoch 1 - iter 156/787 - loss 0.89963674 - samples/sec: 8.29

@dcabo dcabo changed the title Try model training in AWS Try model training in the cloud Nov 6, 2019
@dcabo
Copy link
Member Author

dcabo commented Nov 6, 2019

In terms of cost and performance, GCE vs AWS, we have:

  • GCE: 2 vCPUs highmem, 13GB memory, Nvidia K80: $0.404/hour ($295.20/month, 730 hours) => 41 samples/second
  • AWS p2.xlarge: 4 vCPUs (12 ECUs), 61GiB memory, Nvidia K80: $0.90/hour => 49 samples/second
  • AWS g4dn.xlarge: 4 vCPUs, 16GB memory, Nvidia T4: $0.526/hour => ~12 samples/second with batch size 32 and no GPU fine tuning. ~51 samples/second with batch size 4 and GPU configured with sudo nvidia-smi -ac 5001,1590 (MEM 5001, SM 1590).
  • Google Colab: free => 30 samples/second

For reference:

  • MacBook Pro 2013 (David): ~11 samples/second.
  • MacBook Pro 2018 (Eduardo): ~18 samples/second.
  • Physical Ubuntu server (UPM). 12 CPUs (i7-5930K @ 3.50GHz), 31.3GiB RAM. GPU Nvidia Titan RTX. => 31 samples / second with batch size 32. (I didn't want to bother them with other batch sizes.)

People who train models regularly seem to use instances with more CPU power, e.g. Manuel Garrido mentions the g3.4xlarge (16 vCPUs, 122GiB RAM, Nvidia M60 GPU), and Victor Peinado in GCE uses the n1-highmem-8 (8 vCPUs, 52 GB RAM and a Nvidia V100 GPU). It's quite probable that we're CPU-bound in our tests: when we reduce the batch size for optimal performance the GPU is not fully utilised, and we can see the Python process taking 100% of CPU. When we increase the batch size we're probably increasing thrashing and context switching in the CPU, hence the bad performance.

@dcabo
Copy link
Member Author

dcabo commented Nov 11, 2019

We've learnt quite a lot here, enough for now to train our models.

@dcabo dcabo closed this as completed Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants