How to reproduce the bleu score in 2 GPU cards? #4

Daisy-123 · 2020-07-28T03:42:23Z

My env :

2 NVIDA GeForce RTX 2080 Ti
pytorch 1.5.0

Data source : http://www.statmt.org/wmt17/translation-task.html

include "News Commentary v12" and "UN Parallel Corpus V1.0"

Data preprocess follow prepare.sh

Train :

CUDA_VISIBLE_DEVICES=0,1 fairseq-train data-bin/wmt17_en_zh -a transformer --optimizer adam -s en -t zh --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 1000000 --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 --keep-last-epochs 10 --save-dir checkpoints_test |& tee -a wmt17_train.test.log

Then I got very bad score ...

2020-07-28 11:22:01 | INFO | fairseq_cli.generate | Generate test with beam=5: BLEU4 = 0.00, 6.4/0.0/0.0/0.0 (BP=0.444, ratio=0.552, syslen=26013, reflen=47155)

Training log is here !

https://drive.google.com/file/d/11l5c8VFH1nmZxjbVhD15U3PbWHBFkCtd/view?usp=sharing

Can you give me some suggestion about this result ?
Thank you !

sanxing-chen · 2020-07-28T04:19:41Z

Please take a look at #3 (comment)

In your case, set --update-freq to 3 might be appropriate since the original setting is under a 6 GPUs sever.

Daisy-123 · 2020-07-30T02:30:26Z

I try to set --update-freq 3 , but the training loss still decrease slowly .

2GPU / freq=3

epoch | train_loss | valid _loss

1     |    7.516    |   9.577
2     |     6.807   |   9.442
3     |     6.709   |   9.353
4     |     6.660   |   9.357
5     |     6.631   |   9.285
6     |     6.610   |   9.294

Should I try to increase the max-token size? or this result just because it's difference between 2 and 6 GPU?
My log file :
https://drive.google.com/file/d/11l5c8VFH1nmZxjbVhD15U3PbWHBFkCtd/view?usp=sharing

Thank you .

sanxing-chen · 2020-07-30T19:22:44Z

The update frequency in the log still looks weird to me, you can check out my log file here #3 (comment). For example, after the first epoch, your learning rate is much lower than mine.

afaq-ahmad · 2021-01-12T15:05:38Z

Hi, someone can please guide me how to start the training correctly. I have created the dataset the same way. I have around 24millions sentences, but when I start training but the loss not decreasing and accuracy is only in points like 0.85 bleu score after 5 epochs and 3 days of training on p100 single gpu. I am using this command

CUDA_VISIBLE_DEVICES=0 fairseq-train
data-bin/wmt17_en_zh
--arch transformer_iwslt_de_en --share-decoder-input-output-embed
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000
--dropout 0.3 --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --label-smoothing 0.1
--max-tokens 8192
--eval-bleu
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
--eval-bleu-detok moses
--eval-bleu-remove-bpe
--eval-bleu-print-samples
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric --update-freq 4

sanxing-chen added the question Further information is requested label Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reproduce the bleu score in 2 GPU cards? #4

How to reproduce the bleu score in 2 GPU cards? #4

Daisy-123 commented Jul 28, 2020 •

edited

Loading

sanxing-chen commented Jul 28, 2020 •

edited

Loading

Daisy-123 commented Jul 30, 2020

sanxing-chen commented Jul 30, 2020

afaq-ahmad commented Jan 12, 2021

How to reproduce the bleu score in 2 GPU cards? #4

How to reproduce the bleu score in 2 GPU cards? #4

Comments

Daisy-123 commented Jul 28, 2020 • edited Loading

My env :

Data source : http://www.statmt.org/wmt17/translation-task.html

Data preprocess follow prepare.sh

Train :

Then I got very bad score ...

Training log is here !

https://drive.google.com/file/d/11l5c8VFH1nmZxjbVhD15U3PbWHBFkCtd/view?usp=sharing

sanxing-chen commented Jul 28, 2020 • edited Loading

Daisy-123 commented Jul 30, 2020

2GPU / freq=3

epoch | train_loss | valid _loss

sanxing-chen commented Jul 30, 2020

afaq-ahmad commented Jan 12, 2021

Daisy-123 commented Jul 28, 2020 •

edited

Loading

sanxing-chen commented Jul 28, 2020 •

edited

Loading