Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue about the training process #18

Open
melohux opened this issue Aug 23, 2021 · 4 comments
Open

Issue about the training process #18

melohux opened this issue Aug 23, 2021 · 4 comments

Comments

@melohux
Copy link

melohux commented Aug 23, 2021

Thanks for your excellent work and I met some problems when I train your model following your instruction.

  1. You claimed in your paper that you were using batch size 256 for all experimental results but in your instruction ./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce it seems that this command will launch a training with batch size 128*8.

  2. When I follow your command ./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce, the training process seems to be correct but the validation process has some problems:

Train: 129 [ 0/1251 ( 0%)] Loss: 1.837984 (1.8380) Time: 1.869s, 547.82/s (1.869s, 547.82/s) LR: 1.000e-05 Data: 1.558 (1.558)
Train: 129 [ 50/1251 ( 4%)] Loss: 1.915305 (1.8766) Time: 0.339s, 3023.06/s (0.370s, 2768.98/s) LR: 1.000e-05 Data: 0.016 (0.047)
Train: 129 [ 100/1251 ( 8%)] Loss: 1.936936 (1.8967) Time: 0.338s, 3028.31/s (0.355s, 2885.79/s) LR: 1.000e-05 Data: 0.019 (0.033)
Train: 129 [ 150/1251 ( 12%)] Loss: 1.877319 (1.8919) Time: 0.336s, 3045.61/s (0.349s, 2931.77/s) LR: 1.000e-05 Data: 0.017 (0.028)
Train: 129 [ 200/1251 ( 16%)] Loss: 1.827796 (1.8791) Time: 0.340s, 3011.16/s (0.347s, 2953.92/s) LR: 1.000e-05 Data: 0.022 (0.025)
Train: 129 [ 250/1251 ( 20%)] Loss: 1.865778 (1.8769) Time: 0.338s, 3031.46/s (0.345s, 2966.32/s) LR: 1.000e-05 Data: 0.018 (0.024)
Train: 129 [ 300/1251 ( 24%)] Loss: 1.879160 (1.8772) Time: 0.343s, 2982.10/s (0.344s, 2975.39/s) LR: 1.000e-05 Data: 0.017 (0.023)
Train: 129 [ 350/1251 ( 28%)] Loss: 1.857682 (1.8747) Time: 0.337s, 3039.01/s (0.344s, 2980.71/s) LR: 1.000e-05 Data: 0.018 (0.022)
Train: 129 [ 400/1251 ( 32%)] Loss: 1.845622 (1.8715) Time: 0.339s, 3017.76/s (0.343s, 2984.77/s) LR: 1.000e-05 Data: 0.019 (0.022)
Train: 129 [ 450/1251 ( 36%)] Loss: 1.938300 (1.8782) Time: 0.335s, 3052.68/s (0.343s, 2989.22/s) LR: 1.000e-05 Data: 0.018 (0.021)
Train: 129 [ 500/1251 ( 40%)] Loss: 1.805174 (1.8716) Time: 0.339s, 3018.73/s (0.342s, 2992.69/s) LR: 1.000e-05 Data: 0.021 (0.021)
Train: 129 [ 550/1251 ( 44%)] Loss: 1.859214 (1.8705) Time: 0.340s, 3013.13/s (0.342s, 2994.75/s) LR: 1.000e-05 Data: 0.018 (0.021)
Train: 129 [ 600/1251 ( 48%)] Loss: 1.872183 (1.8707) Time: 0.338s, 3029.93/s (0.342s, 2997.84/s) LR: 1.000e-05 Data: 0.017 (0.020)
Train: 129 [ 650/1251 ( 52%)] Loss: 1.859764 (1.8699) Time: 0.351s, 2916.84/s (0.341s, 2999.01/s) LR: 1.000e-05 Data: 0.016 (0.020)
Train: 129 [ 700/1251 ( 56%)] Loss: 1.845083 (1.8682) Time: 0.339s, 3021.80/s (0.341s, 3000.37/s) LR: 1.000e-05 Data: 0.019 (0.020)
Train: 129 [ 750/1251 ( 60%)] Loss: 1.987917 (1.8757) Time: 0.337s, 3038.19/s (0.341s, 3000.99/s) LR: 1.000e-05 Data: 0.017 (0.020)
Train: 129 [ 800/1251 ( 64%)] Loss: 1.889720 (1.8765) Time: 0.344s, 2979.40/s (0.341s, 3002.14/s) LR: 1.000e-05 Data: 0.018 (0.020)
Train: 129 [ 850/1251 ( 68%)] Loss: 1.952255 (1.8807) Time: 0.341s, 3006.07/s (0.341s, 3003.34/s) LR: 1.000e-05 Data: 0.019 (0.020)
Train: 129 [ 900/1251 ( 72%)] Loss: 1.884332 (1.8809) Time: 0.342s, 2996.99/s (0.341s, 3003.49/s) LR: 1.000e-05 Data: 0.017 (0.019)
Train: 129 [ 950/1251 ( 76%)] Loss: 1.888057 (1.8813) Time: 0.336s, 3045.26/s (0.341s, 3004.35/s) LR: 1.000e-05 Data: 0.019 (0.019)
Train: 129 [1000/1251 ( 80%)] Loss: 1.835092 (1.8791) Time: 0.342s, 2993.21/s (0.341s, 3004.89/s) LR: 1.000e-05 Data: 0.020 (0.019)
Train: 129 [1050/1251 ( 84%)] Loss: 1.847999 (1.8777) Time: 0.336s, 3047.83/s (0.341s, 3005.33/s) LR: 1.000e-05 Data: 0.016 (0.019)
Train: 129 [1100/1251 ( 88%)] Loss: 1.849290 (1.8764) Time: 0.336s, 3048.30/s (0.341s, 3005.95/s) LR: 1.000e-05 Data: 0.018 (0.019)
Train: 129 [1150/1251 ( 92%)] Loss: 1.883289 (1.8767) Time: 0.334s, 3068.20/s (0.341s, 3006.69/s) LR: 1.000e-05 Data: 0.016 (0.019)
Train: 129 [1200/1251 ( 96%)] Loss: 1.855369 (1.8759) Time: 0.335s, 3054.89/s (0.340s, 3007.43/s) LR: 1.000e-05 Data: 0.018 (0.019)
Train: 129 [1250/1251 (100%)] Loss: 1.903906 (1.8769) Time: 0.318s, 3224.92/s (0.340s, 3008.18/s) LR: 1.000e-05 Data: 0.000 (0.019)
Distributing BatchNorm running means and vars
Test: [ 0/48] Time: 1.501 (1.501) Loss: 9.5781 (9.5781) Acc@1: 0.0000 ( 0.0000) Acc@5: 0.2930 ( 0.2930)
Test: [ 48/48] Time: 0.068 (0.242) Loss: 9.5078 (9.5792) Acc@1: 0.0000 ( 0.0940) Acc@5: 0.2358 ( 0.3160)

Does the training log match your training process? Do you have any idea for the problem of the testing part?

@theFoxofSky
Copy link
Owner

"The learning rate is set to 0.1 with batch size 256 and decays to 1e-5 following the cosine schedule. " This line in paper means that I set 0.1 for batch size 256, i.e.

lr = 0.1 * (batch_size / 256),

that is 0.4 for batch size 1024 (4*256).

Sorry for the unclear words.

Since I am working on the next paper based on DDF, I have updated this repo several times. I will check the validation code. You can also verify it by using the released model parameters.

@melohux
Copy link
Author

melohux commented Aug 23, 2021

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

@theFoxofSky
Copy link
Owner

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

I use 1024 for R50, 512 for R101.

@xiaoachen98
Copy link

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

I use 1024 for R50, 512 for R101.

And about the R101, what' your data aumentation schedule?
I found random erasing, aotu-augment and color jitter in your train code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants