Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper way to finetune on own data #56

Open
mqadri9 opened this issue Aug 13, 2020 · 3 comments
Open

Proper way to finetune on own data #56

mqadri9 opened this issue Aug 13, 2020 · 3 comments

Comments

@mqadri9
Copy link

mqadri9 commented Aug 13, 2020

Hello! Thanks for the great work. I have a question concerning the proper way to finetune the pretrained network with a custom stereo dataset (~3000 images ) with no groundtruth.
I am currently using Stereo_Online_Adaptation.py to finetune the provided network trained with MADNet/kitti with MODE=FULL, epochs=100, is_training=False, batch_size=1, shuffle=False, augment=False and using all 3000 images. My aim is not to have real-time stereo but to obtain a good disparity map as part of a postprocessing offline step for each of the 3000 images.

Are there any constraints on the disparity range that the network can detect (for example, the network seems to output a very low disparity if the object is close to the camera?)

I am passing a dummy GT disparity so should I expect any of (EPE, SSIM and bad3) to decrease?

Step:16800 bad3:0.95 EPE:66.88 SSIM:0.07 f/b time:0.077593 Missing time:5:39:12.643258
Step:16900 bad3:0.98 EPE:68.59 SSIM:0.10 f/b time:0.073424 Missing time:5:20:51.903022
Step:17000 bad3:0.95 EPE:68.69 SSIM:0.04 f/b time:0.073451 Missing time:5:20:51.445041
Step:17100 bad3:0.97 EPE:71.85 SSIM:0.05 f/b time:0.073457 Missing time:5:20:45.630999
Step:17200 bad3:0.98 EPE:88.16 SSIM:0.21 f/b time:0.073391 Missing time:5:20:21.049521
Step:17300 bad3:0.94 EPE:64.01 SSIM:0.03 f/b time:0.073562 Missing time:5:20:58.482237
Step:17400 bad3:0.98 EPE:91.38 SSIM:0.19 f/b time:0.073448 Missing time:5:20:21.286057
Step:17500 bad3:0.97 EPE:66.32 SSIM:0.04 f/b time:0.073532 Missing time:5:20:35.964231

Does this seem like the correct procedure to finetune the network? Do you have any other recommendations? Thanks

@AlessioTonioni
Copy link
Member

Hi, sorry for the late reply.
Yes the procedure seems correct. The network does not have a strict upper bound on the disparity, but being fine tuned on kitti it is tuned for disparity mostly <150.
Since you are feeding a dummy GT EPE and bad3 are meaningless, SSIM instead is the loss that you are minimizing so it should go down.

@mqadri9
Copy link
Author

mqadri9 commented Aug 25, 2020

@AlessioTonioni Thanks for your answer! This helps. I have a few more quick questions:

The SSIM values seem to oscillate while training and not going down. Do you think increasing the batch size to greater than 1, or setting either of shuffle to True, or augment to True would help? Not sure if these affect the finetuning process or not. The 3000 images I am training on are a sequence of images from a video feed. I can give this a try.

The images are also captured using a robot moving in a straight line and the stereo camera facing the left side of the robot (as opposed to facing the front of the vehicle as in the Kitti dataset with images captured along the direction of the motion). For some stereo pairs, the left part of the left image is not captured in the right image. Would this be a problem when finetuning since the network tries to sample the left image from the right image?

Do you recommend retraining using the unsupervised loss from scratch (since I don't have GT disparities) instead of initializing with the Kitti dataset weights? If so, do you have hyperparameters configuration that I can start experimenting with? Thanks!

@AlessioTonioni
Copy link
Member

Yes, definitely use a bigger batch size and randomization of the data with shuffle and augment. If possible use also more data if the 3K frames are all coming from a single sequence.

For some stereo pairs, the left part of the left image is not captured in the right image. Would this be a problem when finetuning since the network tries to sample the left image from the right image?

That's the common case in stereo, it's an implicit problem related to occlusions, for that pixel no usefull training gradient can be computed.

Do you recommend retraining using the unsupervised loss from scratch (since I don't have GT disparities) instead of initializing with the Kitti dataset weights?

No, but maybe try to start from the network trained on synthetic data if your scenario is quite different from KITTI (as it seems).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants