Why use learning rate 30~50 #18

IgorSusmelj · 2019-11-26T10:09:42Z

I saw your note and it seems rather unusual to use such a large learning rate:

Note: When training linear classifiers on top of ResNets, it's important to use large learning rate, e.g., 30~50.

Is there something I'm missing? I can't imagine how you get stable gradient descent with such high learning rates.

HobbitLong · 2019-11-26T15:10:22Z

Good catch! I also could not imagine it back then.

I think it's because the scale of features leaned by contrastive methods is very different. If you are not comfortable with large learning rate. Here are two ways to fix it: (1) add a non-parametric BN (e.g., nn.BatchNorm1d(in_channel, affine=False)) right before the linear classification layer to normalize the feature, then you will have a normal learning rate. (2) use Adam optimizer.

IgorSusmelj · 2019-11-26T20:12:05Z

Ok, thanks for the quick reply. That would indeed be a another way to deal with it.

However, I would further investigate the distribution of weights and features to figure out what exactly happened. This might even improve the stability of the model. I can look into it but won't have much time in the coming weeks.

HobbitLong · 2019-11-26T20:37:18Z

@IgorSusmelj , it would be great to see the distribution of features and weights and understand why! Perhaps It is a good research problem, which I also wonder a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use learning rate 30~50 #18

Why use learning rate 30~50 #18

IgorSusmelj commented Nov 26, 2019

HobbitLong commented Nov 26, 2019

IgorSusmelj commented Nov 26, 2019

HobbitLong commented Nov 26, 2019

Why use learning rate 30~50 #18

Why use learning rate 30~50 #18

Comments

IgorSusmelj commented Nov 26, 2019

HobbitLong commented Nov 26, 2019

IgorSusmelj commented Nov 26, 2019

HobbitLong commented Nov 26, 2019