-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue about the training process #18
Comments
"The learning rate is set to 0.1 with batch size 256 and decays to 1e-5 following the cosine schedule. " This line in paper means that I set 0.1 for batch size 256, i.e. lr = 0.1 * (batch_size / 256), that is 0.4 for batch size 1024 (4*256). Sorry for the unclear words. Since I am working on the next paper based on DDF, I have updated this repo several times. I will check the validation code. You can also verify it by using the released model parameters. |
So the experimental results in your paper are obtained by training with batch size 256 or 1024? And if my training log matches yours in terms of the loss value? In addition, if you can check the validation code it would be great, thanks. |
I use 1024 for R50, 512 for R101. |
And about the R101, what' your data aumentation schedule? |
Thanks for your excellent work and I met some problems when I train your model following your instruction.
You claimed in your paper that you were using batch size 256 for all experimental results but in your instruction
./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce
it seems that this command will launch a training with batch size 128*8.When I follow your command
./distributed_train.sh 8 <path_to_imagenet> --model ddf_mul_resnet50 --lr 0.4 \ --warmup-epochs 5 --epochs 120 --sched cosine -b 128 -j 6 --amp --dist-bn reduce
, the training process seems to be correct but the validation process has some problems:Does the training log match your training process? Do you have any idea for the problem of the testing part?
The text was updated successfully, but these errors were encountered: