is NLL correct? #5

MokkeMeguru · 2020-03-17T02:50:54Z

Hello, I wonder your nll is correct?

In,
https://github.com/VLL-HD/conditional_invertible_neural_networks/blob/master/mnist_minimal_example/eval.py#L44-L45

I think, your z is the latent variable, and jac is log_det_jacobian in normalizing flow.

 nll = (log_prob(z) + log_det_jacobian) / pixels

But I think you forgot the transformation cost about a discrete image to a continuous image.

In RealNVP, he calculates its cost image by image.
https://github.com/tensorflow/models/blob/master/research/real_nvp/real_nvp_multiscale_dataset.py#L1063-L1077 (in Glow, Flow++, we can find this cost function, too)

The text was updated successfully, but these errors were encountered:

psteinb · 2020-06-17T15:25:04Z

maybe related to this, the magnitude of the trainable parameters theta is missing.

The preprint reads:

but AFAIK the minimal examples do not list this last term:
https://github.com/VLL-HD/conditional_invertible_neural_networks/blob/45dc7250ebfecf10d1a278edafde0fe899f30aa1/colorization_minimal_example/train.py#L25

I may oversee something or tau*norm(theta)**2 is always constant which is hard to believe at plain sight.

MokkeMeguru · 2020-06-17T18:42:53Z

@psteinb
theta means the model's parameter. So the term of theta means l2 normalization in the whole parameter such as inv1x1conv networks etc. (l2 normalization often applied
implicitly because Tensorflow and PyTorch do so with some layer's argument like l2_normalizaiont=True . (I hate this implicitly loss pollution.))

Tips: l2 normalization is good for inv1x1conv which is discussed in Open AI's Glow openai/glow#40 (comment)

psteinb · 2020-06-18T08:56:38Z

thanks for the hint. I must confess that I think I understand the math but maybe I am just too new to pytorch and freia.

Can you or @ardizzone et al give me direct hint/pointer where in this repo or its dependencies this L2 weight normalisation is performed? I'd appreciate that.

MokkeMeguru · 2020-06-18T10:50:43Z

@ardizzone
No. You will learn the neural networks and their optimization.

You should watch the hyper-parameter, weight_decay in Adam Optimizer.
https://github.com/VLL-HD/conditional_invertible_neural_networks/blob/master/colorization_cINN/model.py#L249

See weight-decay's explanation.
https://pytorch.org/docs/stable/optim.html

(But we know l2 regularization is good for kernel parameter, not good for bias...
https://stats.stackexchange.com/a/167608)

psteinb · 2020-06-18T15:53:57Z

Alright, triggered by your reply I looked into this a bit. As mentioned, please excuse my novice expertise with pytorch and feel free to correct me at any point.

So if I understand correctly, the L2 regularization term mentioned in equation (6) of the preprint to this repo is assumed to be backed into lines like https://github.com/VLL-HD/conditional_invertible_neural_networks/blob/45dc7250ebfecf10d1a278edafde0fe899f30aa1/colorization_cINN/model.py#L249

But I start to believe that this assumption does not hold and should rather be used with AdamW. So here is the story:

first, I found this excellent stackoverflow post which to me supports the notion that many people believe weight decay in Adam imposes L2 regularisation
however, as demonstrated by this preprint this appears not to be true (later accepted to ICLR2019; see also this nice post for a more approachable and shorter discussion of this issue
to me, the pytorch documentation is not really explicit what is going on and quite frankly I currently lack the insights into the framework to get a deeper impression quickly, however the AdamW optimizer claims to implement Adam with the changes proposed in the preprint above i.e. performs L2 regularisation as expected
reading through this pytorch PR supports the discussion above; turns out that AdamW was merged last year, but the implementation is still under discussion as it apparently diverges from the preprint above, see [this link])Implement AdamW optimizer pytorch/pytorch#21250 (comment))

So this supports my notion expressed above, that the code does not do what the paper promises, i.e. perform L2 regularisation on the weights in the loss term. However, the only thing I can suggest to do in order to mitigate it is adding the L2 regularisation explicitly to the loss term. E.g. here https://github.com/VLL-HD/conditional_invertible_neural_networks/blob/45dc7250ebfecf10d1a278edafde0fe899f30aa1/colorization_cINN/train.py#L67

#untested code!
l = torch.mean(neg_log_likeli) / tot_output_size + tau*torch.norm(model.params_trainable)

Or you use plain SGD instead.

MokkeMeguru · 2020-06-18T19:57:52Z

Yeah, you are correct. Adam is the black box for me, too.

And also, we should read their all release notes. But I don't recommend it.
We should implement custom Adam, use SGD or use AdamW.

Or... In OpenAI's Glow or many other Flow-based Model Implementation, they don't use l2 regularization for whole training parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is NLL correct? #5

is NLL correct? #5

MokkeMeguru commented Mar 17, 2020

psteinb commented Jun 17, 2020

MokkeMeguru commented Jun 17, 2020 •

edited

Loading

psteinb commented Jun 18, 2020

MokkeMeguru commented Jun 18, 2020 •

edited

Loading

psteinb commented Jun 18, 2020

MokkeMeguru commented Jun 18, 2020

is NLL correct? #5

is NLL correct? #5

Comments

MokkeMeguru commented Mar 17, 2020

psteinb commented Jun 17, 2020

MokkeMeguru commented Jun 17, 2020 • edited Loading

psteinb commented Jun 18, 2020

MokkeMeguru commented Jun 18, 2020 • edited Loading

psteinb commented Jun 18, 2020

MokkeMeguru commented Jun 18, 2020

MokkeMeguru commented Jun 17, 2020 •

edited

Loading

MokkeMeguru commented Jun 18, 2020 •

edited

Loading