Issue about the training process #18

melohux · 2021-08-23T01:43:04Z

Thanks for your excellent work and I met some problems when I train your model following your instruction.

You claimed in your paper that you were using batch size 256 for all experimental results but in your instruction ./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce it seems that this command will launch a training with batch size 128*8.
When I follow your command ./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce, the training process seems to be correct but the validation process has some problems:

Train: 129 [ 0/1251 ( 0%)] Loss: 1.837984 (1.8380) Time: 1.869s, 547.82/s (1.869s, 547.82/s) LR: 1.000e-05 Data: 1.558 (1.558)
Train: 129 [ 50/1251 ( 4%)] Loss: 1.915305 (1.8766) Time: 0.339s, 3023.06/s (0.370s, 2768.98/s) LR: 1.000e-05 Data: 0.016 (0.047)
Train: 129 [ 100/1251 ( 8%)] Loss: 1.936936 (1.8967) Time: 0.338s, 3028.31/s (0.355s, 2885.79/s) LR: 1.000e-05 Data: 0.019 (0.033)
Train: 129 [ 150/1251 ( 12%)] Loss: 1.877319 (1.8919) Time: 0.336s, 3045.61/s (0.349s, 2931.77/s) LR: 1.000e-05 Data: 0.017 (0.028)
Train: 129 [ 200/1251 ( 16%)] Loss: 1.827796 (1.8791) Time: 0.340s, 3011.16/s (0.347s, 2953.92/s) LR: 1.000e-05 Data: 0.022 (0.025)
Train: 129 [ 250/1251 ( 20%)] Loss: 1.865778 (1.8769) Time: 0.338s, 3031.46/s (0.345s, 2966.32/s) LR: 1.000e-05 Data: 0.018 (0.024)
Train: 129 [ 300/1251 ( 24%)] Loss: 1.879160 (1.8772) Time: 0.343s, 2982.10/s (0.344s, 2975.39/s) LR: 1.000e-05 Data: 0.017 (0.023)
Train: 129 [ 350/1251 ( 28%)] Loss: 1.857682 (1.8747) Time: 0.337s, 3039.01/s (0.344s, 2980.71/s) LR: 1.000e-05 Data: 0.018 (0.022)
Train: 129 [ 400/1251 ( 32%)] Loss: 1.845622 (1.8715) Time: 0.339s, 3017.76/s (0.343s, 2984.77/s) LR: 1.000e-05 Data: 0.019 (0.022)
Train: 129 [ 450/1251 ( 36%)] Loss: 1.938300 (1.8782) Time: 0.335s, 3052.68/s (0.343s, 2989.22/s) LR: 1.000e-05 Data: 0.018 (0.021)
Train: 129 [ 500/1251 ( 40%)] Loss: 1.805174 (1.8716) Time: 0.339s, 3018.73/s (0.342s, 2992.69/s) LR: 1.000e-05 Data: 0.021 (0.021)
Train: 129 [ 550/1251 ( 44%)] Loss: 1.859214 (1.8705) Time: 0.340s, 3013.13/s (0.342s, 2994.75/s) LR: 1.000e-05 Data: 0.018 (0.021)
Train: 129 [ 600/1251 ( 48%)] Loss: 1.872183 (1.8707) Time: 0.338s, 3029.93/s (0.342s, 2997.84/s) LR: 1.000e-05 Data: 0.017 (0.020)
Train: 129 [ 650/1251 ( 52%)] Loss: 1.859764 (1.8699) Time: 0.351s, 2916.84/s (0.341s, 2999.01/s) LR: 1.000e-05 Data: 0.016 (0.020)
Train: 129 [ 700/1251 ( 56%)] Loss: 1.845083 (1.8682) Time: 0.339s, 3021.80/s (0.341s, 3000.37/s) LR: 1.000e-05 Data: 0.019 (0.020)
Train: 129 [ 750/1251 ( 60%)] Loss: 1.987917 (1.8757) Time: 0.337s, 3038.19/s (0.341s, 3000.99/s) LR: 1.000e-05 Data: 0.017 (0.020)
Train: 129 [ 800/1251 ( 64%)] Loss: 1.889720 (1.8765) Time: 0.344s, 2979.40/s (0.341s, 3002.14/s) LR: 1.000e-05 Data: 0.018 (0.020)
Train: 129 [ 850/1251 ( 68%)] Loss: 1.952255 (1.8807) Time: 0.341s, 3006.07/s (0.341s, 3003.34/s) LR: 1.000e-05 Data: 0.019 (0.020)
Train: 129 [ 900/1251 ( 72%)] Loss: 1.884332 (1.8809) Time: 0.342s, 2996.99/s (0.341s, 3003.49/s) LR: 1.000e-05 Data: 0.017 (0.019)
Train: 129 [ 950/1251 ( 76%)] Loss: 1.888057 (1.8813) Time: 0.336s, 3045.26/s (0.341s, 3004.35/s) LR: 1.000e-05 Data: 0.019 (0.019)
Train: 129 [1000/1251 ( 80%)] Loss: 1.835092 (1.8791) Time: 0.342s, 2993.21/s (0.341s, 3004.89/s) LR: 1.000e-05 Data: 0.020 (0.019)
Train: 129 [1050/1251 ( 84%)] Loss: 1.847999 (1.8777) Time: 0.336s, 3047.83/s (0.341s, 3005.33/s) LR: 1.000e-05 Data: 0.016 (0.019)
Train: 129 [1100/1251 ( 88%)] Loss: 1.849290 (1.8764) Time: 0.336s, 3048.30/s (0.341s, 3005.95/s) LR: 1.000e-05 Data: 0.018 (0.019)
Train: 129 [1150/1251 ( 92%)] Loss: 1.883289 (1.8767) Time: 0.334s, 3068.20/s (0.341s, 3006.69/s) LR: 1.000e-05 Data: 0.016 (0.019)
Train: 129 [1200/1251 ( 96%)] Loss: 1.855369 (1.8759) Time: 0.335s, 3054.89/s (0.340s, 3007.43/s) LR: 1.000e-05 Data: 0.018 (0.019)
Train: 129 [1250/1251 (100%)] Loss: 1.903906 (1.8769) Time: 0.318s, 3224.92/s (0.340s, 3008.18/s) LR: 1.000e-05 Data: 0.000 (0.019)
Distributing BatchNorm running means and vars
Test: [ 0/48] Time: 1.501 (1.501) Loss: 9.5781 (9.5781) Acc@1: 0.0000 ( 0.0000) Acc@5: 0.2930 ( 0.2930)
Test: [ 48/48] Time: 0.068 (0.242) Loss: 9.5078 (9.5792) Acc@1: 0.0000 ( 0.0940) Acc@5: 0.2358 ( 0.3160)

Does the training log match your training process? Do you have any idea for the problem of the testing part?

The text was updated successfully, but these errors were encountered:

theFoxofSky · 2021-08-23T02:29:57Z

"The learning rate is set to 0.1 with batch size 256 and decays to 1e-5 following the cosine schedule. " This line in paper means that I set 0.1 for batch size 256, i.e.

lr = 0.1 * (batch_size / 256),

that is 0.4 for batch size 1024 (4*256).

Sorry for the unclear words.

Since I am working on the next paper based on DDF, I have updated this repo several times. I will check the validation code. You can also verify it by using the released model parameters.

melohux · 2021-08-23T02:46:19Z

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

theFoxofSky · 2021-08-23T03:41:19Z

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

I use 1024 for R50, 512 for R101.

xiaoachen98 · 2022-01-07T03:16:54Z

So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks.

I use 1024 for R50, 512 for R101.

And about the R101, what' your data aumentation schedule?
I found random erasing, aotu-augment and color jitter in your train code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue about the training process #18

Issue about the training process #18

melohux commented Aug 23, 2021 •

edited

Loading

theFoxofSky commented Aug 23, 2021

melohux commented Aug 23, 2021 •

edited

Loading

theFoxofSky commented Aug 23, 2021

xiaoachen98 commented Jan 7, 2022

Issue about the training process #18

Issue about the training process #18

Comments

melohux commented Aug 23, 2021 • edited Loading

theFoxofSky commented Aug 23, 2021

melohux commented Aug 23, 2021 • edited Loading

theFoxofSky commented Aug 23, 2021

xiaoachen98 commented Jan 7, 2022

melohux commented Aug 23, 2021 •

edited

Loading

melohux commented Aug 23, 2021 •

edited

Loading