Fixes application gradients in SGD/SGDW
Pre-release
Pre-release
It helps if you use the gradients. 🤦
The PyTorch SGD implementation is a little confusing, in that it uses the d_p
variable for the gradient to represent three different quantities depending on the combination of options, and some of those values aren't the gradient. In seeking to clarify the variable names, I inadvertently dropped an important behavior—namely, applying the gradient when the momentum is zero.