-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LEARNING RATE SCHEDULER #238
Comments
Hello glenn-jocher |
@announce1 this equation corresponds to the orange and green curves above. It's an inverse exponential that decays the LR to zero by the value |
Hi glenn-jocher |
@announce1 the function is a simple exponential, its very common in statistics. |
Hi glenn-jocher, During the first epoch itself, I got I've trained yolov2 on similar dataset with So, If I use Thanks |
LR burnin (burn-in) during first 1000 batches: Lines 221 to 226 in 1a9aa30
|
@glenn-jocher |
@www12345678 Hello, thank you for your interest in our work! Please note that most technical problems are due to:
sudo rm -rf yolov3 # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py # verify detection
python3 train.py # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE
If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you! |
Lines 254 to 263 in 74b5750
Why "burn in" is disabled now? |
@mozpp burn-in is unneeded anymore. GIoU stabilizes the unbounded wh loss. |
Could you explain how does "prebias" work? I think it is similar to "burn in". |
@mozpp no, prebias attempts to aggressively optimize neuron biases on Conv2d() layes preceding each YOLO layer. There are only 765 of these in yolov3-spp, the rest of the network is frozen and unaffected. There is no relation with burnin. Burnin is reduced LR in initial batches. |
@glenn-jocher Sorry I'm still confused about what prebias does. |
@glenn-jocher Thanks! It seems simply reducing batch size fix it. |
@yujianll reduce |
@yujianll |
@developer0hye Thanks! Is there an argument that I can set to use group normalization? |
@yujianll @glenn-jocher |
This would be a great addition! |
@FranciscoReveriano |
@developer0hye @FranciscoReveriano my understanding of groupnorm is that yes it can approach the results of batchnorm with smaller batch sizes, but not exceed them. I don't have any plans for it at the moment, but if you implement and compare results on coco_64img.data it might be a worthwhile PR. |
@FranciscoReveriano si parece ser superior el nuevo cosine scheduler. Ayuda en la mitad de training, y tambien ayuda un poco el mAP final. Voy a probarlo en un full training de full COCO esta semana, y si tambien ayuda ayi lo convertire en el default scheduler. :) |
Cosine LR scheduler is the new default. See #238 (comment) Lines 144 to 145 in 84371f6
|
@glenn-jocher I saw that you also made a declining Pre-bias |
@FranciscoReveriano yes, I updated it to vary smoothly from the initial prebias conditions (lr and momentum) at epoch 0, to the normal conditions over 3 epochs. I think this might help a bit. |
@glenn-jocher Yo eh visto que cuando hago un prebias de una magnitude mas alta ah la initial me ayuda con los gradients. |
@glenn-jocher I have questions, actually pytorch already has their own cosine scheduler, but why do you re-implement this? Learning Rate warmup with cosine scheduler It's my own implementation for learning rate warmup with pytorch official cosine lr scheduler! Have you ever tried to apply learning rate warmup for stable training in early epoch? |
@developer0hye yes this is true, perhaps we should use a warmup period, or 'burnin' as some people call it. I'm a bit confused about how to handle the initial iterations because right now we have a prebias period where we actually use much higher LRs for the model biases (weight LRs stay the same) for the first few epochs, which is somewhat opposite of a warmup: Lines 216 to 229 in 65eeb1b
The main problem is that particularly for the yolo output biases, the obj and classification biases should be extremely negative to reflect a very low chance of being predicted. For example, for the classification on COCO, the mean output bias should be Lines 88 to 92 in 65eeb1b
|
I am finding that doing a stable larger learning rate for a longer period avoids nan later down the road. |
@glenn-jocher I think it can be the solution for that to use the hard negative mining method with focal loss. |
@developer0hye Do you have a paper? Or article discussing it? |
@developer0hye I've seen focal loss to accelerate the training and produce higher mAPs sooneer, but I found that it also accelerates _over_training and results in a lower final mAPs. I could never get it to help out on COCO. I think towards the end of training the main challenge is suppressing overtraining (keeping the validation losses from increasing). If we could manage that, then longer training would result in better mAPs. |
@FranciscoReveriano And... the method (the hard negative mining with focal loss) I mentioned is just my opinion. @glenn-jocher |
@developer0hye Thanks. I am going to look more into it. I tried focal loss on my dataset but did't get very good results. Maybe this will help with Focal Loss. |
Thanks, This repository showed the focal loss improvements in the performance of YOLOv3. |
@developer0hye label smoothing seems like an easy update, and it seems to help in the repo you pointed to. For the positive labels, I'm assigning the GIoU value to them rather than 1.0, as I thought this would help sort the boxes more by IOU for NMS rather than by confidence alone. At the lower end though, I assign all negative examples a target of 0.0. We could modify this to 0.1 as in the label smoothing example to see the effect. |
@developer0hye wait I got myself confused. Can we apply labelsmoothing to nn.BCEWithLogitsLoss() as well as nn.CrossEntropyLoss()? In this repo we only use nn.BCEWithLogitsLoss() now for both obj and cls. |
@glenn-jocher yeah, I think so. We can apply label smoothing to nn.BCEWithLogitsLoss(). |
Just a note, per WongKinYiu/CrossStagePartialNetworks#6 (comment) label smoothing should only be applied to class loss, not obj loss. From https://arxiv.org/pdf/1902.04103.pdf we have this, but they neglect to state the value of epsilon unfortunately. |
I found in the TensorFlow code two different implementations of label smoothing:
where I assume |
@glenn-jocher |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@developer0hye yes, a label smoothing value of 0.1 seems to be commonly used across various repositories that apply the label smoothing method. This can be a good starting point for experimentation, and further adjustments can be made based on specific dataset characteristics and performance evaluation. |
The original darknet learning rate (LR) scheduler parameters are set in a model's *.cfg file:
learning_rate
: initial LRburn_in
: number of batches to ramp LR from 0 tolearning_rate
in epoch 0max_batches
: the number of batches to train the model topolicy
: type of LR schedulersteps
: batch numbers at which LR is reducedscales
: LR multiple applied atsteps
(gamma
in PyTorch)In this repo LR scheduling is set in
train.py
. We set the initial and final LRs as hyperparametershyp['lr0']
andhyp['lrf']
, where the final LR= lr0 * (10 ** lrf)
. For example, if the initial LR is 0.001 and the final LR is 100 times (1e-2) smaller,hyp['lrf']=0.001
andhyp['lrf']=-2
. This plot shows two of the available PyTorch LR schedulers, with the MultiStepLR scheduler following the original darknet implementation (atbatch_size=64
on COCO). To learn more please visit:https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
The LR hyperparameters are tunable, along with all the rest of the model hyperparmeters in
train.py
:yolov3/train.py
Lines 13 to 25 in 1771ffb
Actual LR scheduling is set further down in
train.py
, and has been tuned for COCO training. You may want to set your own scheduler according to your specific custom dataset and training requirements, and also adjust it's hyperparameters accordingly.yolov3/train.py
Lines 102 to 109 in bd2378f
The text was updated successfully, but these errors were encountered: