Skip to content

RossM/scram-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo contains various experiments I'm doing in optimizers and learning rate schedulers.

scram_pytorch.scram

SCRAM (Scale and Rotation Invariant Momentum)

This is similar to the LION optimizer, but normalizes each parameter's updates using the root mean square (RMS) rather than the sign. This makes the optimizer invariant to orthonormal transformations that rotate channels into each other.

Recommended hyperparameters for a model where AdamW is best at lr=1e-4:

eps learning rate beta1 beta2
1e-15 1e-6 0.98 0.99

For best results, gradient clipping should be disabled.

scram_pytorch.simon

SIMON (Sigma Momentum)

An AdaBelief derivative that incorporates the momentum modifications from Lion, and uses a slightly different way of calculating the standard deviation. The best optimizer I've found for many problems in my tests.

Recommended hyperparameters for a model where AdamW is best at lr=1e-4:

eps learning rate beta1 beta2 rmsclip layerwise normalize
1e-15 1e-4 0.98 0.99 False False False

For best results, gradient clipping should be disabled.

scram_pytorch.esgd

ESGD (Ensemble Stochastic Gradient Descent)

A modification of stochastic gradient descent plus momentum and filterwise normalization, that simulates a very large ensemble of models by maintaining two copies of each weight and randomly selecting one copy to use for each weight independently at each optimization step.

ESGD seems to be particularly good at adversarial training.

Recommended hyperparameters for a model where AdamW is best at lr=1e-4:

eps learning rate beta1 beta2 p swap_ratio
1e-15 1e-4 0.99 0.99 0.5 0.99

For best results, gradient clipping should be disabled.