You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Good catch! I also could not imagine it back then.
I think it's because the scale of features leaned by contrastive methods is very different. If you are not comfortable with large learning rate. Here are two ways to fix it: (1) add a non-parametric BN (e.g., nn.BatchNorm1d(in_channel, affine=False)) right before the linear classification layer to normalize the feature, then you will have a normal learning rate. (2) use Adam optimizer.
Ok, thanks for the quick reply. That would indeed be a another way to deal with it.
However, I would further investigate the distribution of weights and features to figure out what exactly happened. This might even improve the stability of the model. I can look into it but won't have much time in the coming weeks.
@IgorSusmelj , it would be great to see the distribution of features and weights and understand why! Perhaps It is a good research problem, which I also wonder a lot.
I saw your note and it seems rather unusual to use such a large learning rate:
Note: When training linear classifiers on top of ResNets, it's important to use large learning rate, e.g., 30~50.
Is there something I'm missing? I can't imagine how you get stable gradient descent with such high learning rates.
The text was updated successfully, but these errors were encountered: