Trying to replicate sp_v6 and descriptor loss #288

martinarroyo · 2023-01-20T13:07:33Z

[This is somewhat related to #287, but I'll open another issue so as not to pollute the other issue]

Hi @rpautrat, thanks for this work and also for providing support for the repo! I am trying to reproduce the results that you report in the README on HPatches as a first step towards making some changes to the model. In order to save some time, I labeled the COCO dataset using the pretrained model listed in the README (MagicPoint (COCO)) and launched a training with the superpoint_coco.yaml config in its current state at HEAD. I had to make minor modifications to the codebase to get it to work in my infra (mostly I/O) but there should be no changes that affect training. I noticed that the negative and positive distances as reported in TensorBoard oscillate within a very small range of values (~[1e-5-1e-7]) and this got me worried. This seems strange based on the values reported in #277 (comment). For reference, here is how it looks on my current training (my machine restarted so the graphs look a bit funny, apologies for that):

Precision and recall are also much lower than those in the sp_v6 log, where recall goes up to 0.6, I can only get it to ~0.37.

I looked into the yaml file in the sp_v6 tarfile and noticed that the loss weights seemed to be adapted for the 'unnormalized' descriptors, so I reverted the changes introduced in 95d1cfd. This helps with the distances (the values are ~0.03 and 0.02 for positive and negative):

But recall is still quite low (~0.38).

I also evaluated the model with normalization on HPatches and the results look reasonable. For comparison, I also loaded the sp_v6 checkpoint and ran the same evaluation:

Viewpoint changes			Illumination changes
Mine	sp_v6 ckpt	Claimed	Mine	sp_v6 ckpt	Claimed
0.613	0.645	0.674	0.630	0.655	0.662

	Illumination			Viewpoint			All
	Mine	sp_v6 ckpt	Claimed	Mine	sp_v6 ckpt	Claimed	Mine	sp_v6 ckpt	Claimed
e=1							0.460	0.477	0.483
e=3	0.940	0.936	0.965	0.650	0.654	0.712	0.793	0.793	0.836
e=5							0.894	0.881	0.91

The change in quality is not too bad, but I am still concerned that the numbers are always slightly below the ones reported in the README of this repository, so I would like to ask the following:

What version of the code was used to train the sp_v6 model? I am assuming it was trained on 2 GPUs using COCO data labeled with this checkpoint, but please correct me if I am wrong.
What model was used to compute the metric values claimed in the README?
Have you experienced the same behaviour with the positive and negative distances in the descriptor loss?

Thanks a lot in advance for your help, much appreciated!

The text was updated successfully, but these errors were encountered:

rpautrat · 2023-01-22T21:58:47Z

Hi, I can try to help you replicate the original results, but up to a limit only. This work is indeed more than 4 years old, and I don't remember all the details of the experiments anymore. But regarding your three points:

The sp_v6 model was indeed trained on 2 GPUs on the COCO dataset, but with pseudo labels generated by this model.
I am not 100% sure, but I think to remember that the metrics of the Readme were computed with the sp_v6 model. What is quite likely is that the metrics themselves changed in the last 4 years, explaining the gap between the claimed values and the new ones you computed. If you also evaluate the original SuperPoint of Magic Leap, you might observe a similar difference.
I think the positive and negative distances are rather noisy and hard to interpret. I don't remember the trend during my trainings, but you can check the tensorboard log in the zip file of the pre-trained models to compare with your own values.

I hope this can be of some help to you.

martinarroyo · 2023-01-23T13:06:03Z

Hi, thanks for the quick reply. I mixed up the links in my message and meant indeed the one pointing to mp_synth-v11_ha1_trained, sorry for the confusion.

I compared the logs for the distances and the rest of the metrics. They look similar, however, I had not noticed before that the detector loss is actually much higher than for sp_v6 (green is my experiment, orange sp_v6). Could this mean that I need to tune $\lambda$?
.

I'll try to also evaluate the Magic Leap SP implementation to see if the discrepancy is similar.

I think this more or less answers my questions. If you could comment on the discrepancy of the detector loss that would be great. I will close the issue once I run the Magic Leap model.

rpautrat · 2023-01-23T14:32:35Z

Indeed, you may want to tune $\lambda$ to better balance the descriptor and detector losses. The latter seems a bit too high in comparison.

ericzzj1989 · 2023-01-31T04:13:30Z

Since issue is somewhat related to my issue #287 (comment), would you @martinarroyo please mind explaining and describing what changes you have made in the code for your results?

martinarroyo · 2023-02-17T13:49:00Z

Apologies for the belated response. My changes were minimal, I only made some modifications to the I/O logic so that it would work in my infrastructure as well as fixing some imports that were not working on my setup. The training logic was unaltered.

shreyasr-upenn · 2024-03-05T18:12:33Z

HI @rpautrat , I have been getting negative distances as zero in every step. Is this normal? What might have occurred? Same hyperparameter values as you have used.

rpautrat · 2024-03-05T21:42:23Z

Hi, having a zero negative loss is not impossible, but surprising. If you look at its definition here:

SuperPoint/superpoint/models/utils.py

Line 123 in 361799f

negative_dist = tf.maximum(0., dot_product_desc - config['negative_margin'])

, it is obtained as a hinge loss max(0, desc_distance - m). So if the desc_distance is lower than the margin m (set to 0.2 by default), the negative loss becomes 0, which means that the model was perfectly able to distinguish different descriptors.

However, getting 0 at every steps seems a bit fishy and to good to be true. I would expect it to be positive for at least a few samples. Maybe you can try to plot a few values in the link above to understand what is happening. Checking the positive loss would also be interesting.

shreyasr-upenn · 2024-03-06T14:21:49Z

I have done the equivalent of this code block in Pytorch:

SuperPoint/superpoint/models/utils.py

Line 140 in 361799f

    
           tf.summary.scalar('positive_dist', tf.reduce_sum(valid_mask * config['lambda_d'] *

` positive_sum = torch.sum(valid_masklambda_ds*positive_dist) / valid_mask_norm

negative_sum = torch.sum(valid_mask*(1-s)*negative_dist) / valid_mask_norm`

rpautrat · 2024-03-06T16:01:18Z

I would suggest printing a few values to debug your code and understand why the negative loss becomes zero in your case. This sounds to good to be true.

martinarroyo closed this as completed Feb 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to replicate sp_v6 and descriptor loss #288

Trying to replicate sp_v6 and descriptor loss #288

martinarroyo commented Jan 20, 2023

rpautrat commented Jan 22, 2023

martinarroyo commented Jan 23, 2023

rpautrat commented Jan 23, 2023

ericzzj1989 commented Jan 31, 2023

martinarroyo commented Feb 17, 2023

shreyasr-upenn commented Mar 5, 2024

rpautrat commented Mar 5, 2024

shreyasr-upenn commented Mar 6, 2024 •

edited

Loading

rpautrat commented Mar 6, 2024

Trying to replicate sp_v6 and descriptor loss #288

Trying to replicate sp_v6 and descriptor loss #288

Comments

martinarroyo commented Jan 20, 2023

rpautrat commented Jan 22, 2023

martinarroyo commented Jan 23, 2023

rpautrat commented Jan 23, 2023

ericzzj1989 commented Jan 31, 2023

martinarroyo commented Feb 17, 2023

shreyasr-upenn commented Mar 5, 2024

rpautrat commented Mar 5, 2024

shreyasr-upenn commented Mar 6, 2024 • edited Loading

rpautrat commented Mar 6, 2024

shreyasr-upenn commented Mar 6, 2024 •

edited

Loading