-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation mAP not consistent #109
Comments
Hi. There are a few things to consider here. However, it seems that you have already changed that. Then only one thing remains. Can you please check the log file from the training output directory and see whether the final epoch's log matches the evaluation logs? I will need to check as well. However, from my multiple experiments, I can confirm that the mAP should be the same. Instead of the WandB logs, please check the model logs from the directory once. |
Hi. Thank you for your response. By model_logs, do you mean the train.log file? If yes, it did not have any contents (0 KB file size). I am currently guessing that it was overwritten when I executed:
I just executed that resume training command whenever there was an error wherein the training stopped. Thus, I think that putting the That is only my guess on why the content of the train.log is empty or blank. Please share your thoughts if you think so too or have other guesses. |
Yes, that may be the reason it is blank. |
Also, please check that the I will also check one thing from my side whether WandB at the end shows the mAP from the best epoch or the last epoch. Because from multiple experiments I know that the numbers should match. |
With that said, should I omit the |
The image size on my opt.yaml file is 320x320. This is the content of the file: model: fasterrcnn_resnet50_fpn_v2 |
Yes, you can give a different name. It will resume training but the resulting directory will be different. You will separate logs for both the training for later analysis. |
Noted. May I ask if the graphs (.png files of mAP and losses) will still be consistent or continuous across those separate resulting directories? Will the graphs (.png files of mAP and losses) still be consistent or continuous in that case? |
Yes, they will be consistent. |
Okay, noted. I will try to train again with stop and resume then observe and provide an update about the outcome, especially for the consistency of mAP output from wandb and local model directory. Thank you very much again for your answers. |
No issues. Let me know. |
Hi. I would like to provide an update regarding my observation. I tried training the model for 4 epochs. I stopped the training during the 3rd epoch (2 if zero-indexed) and resumed up to the fourth and final epoch using the command: Thus, there were two folders in outputs/training which are res_1 and res_2. I noticed the ff:
The results of the final epoch (4th) from the wandb are the ff: wandb: Run summary: The results of the final epoch (4th) using the eval.py with the command: {'classes': tensor([1, 2], dtype=torch.int32), As can be seen, the results for [email protected] and [email protected] were still different for the For additional information, here are the contents of the model: fasterrcnn_resnet50_fpn_v2 And here is the contents of the custom_data.yaml: **TRAIN_DIR_IMAGES: 'custom_data/train' CLASSES: [ NC: 3 SAVE_VALID_PREDICTION_IMAGES: True** As of now, I am thinking that the problem might come from when the training pipeline resumes the training from the most recent epoch. But I am not sure specifically why and where. Because I tried training continuously (no stop and resume) and the mAPs (0.5 and 0.5-0.9) from wandb and eval.py for May I know your thoughts about this and the possible fix we can do? I hope for your response. Thank you very much. |
If you are getting same mAP with continuous training and not with resume training, then I will take a look. I have not compared mAP values with resume training yet, so I guess I missed this issue. Thanks for bringing it up. For now, I can say that you can safely resume training. Just be sure to not rely on the WandB logs and run Let me know if you want to keep this issue open or I can close it and keep you posted on the progress. |
Okay, noted. I would like to keep this issue open until it is resolved and please keep me posted on this thread as well. I would like to try training (with stop and resume) and observing again as soon as the fixed version gets pushed and merged with the main branch. Thank you very much. |
So, I stopped training and resumed. This is from WandB from the last epoch after resuming.
And this is after running evaluation using the following command.
I think they are pretty close. Whatever floating difference is there maybe because during training/validation loop it uses pycocotools and Let me know your thoughts. |
Hi. Thanks for the update. May I if the May I also ask for your initial training command, how you stopped the initial training, then the resume training command you used? Maybe the probable cause lies there, just a guess for now though. |
May I also take a look at the |
Sure. I used the aquarium dataset whose YAML file already comes with the repository. I commented out the test paths.
|
Hi, I was experimenting with the aquarium dataset using Google Colab. I had same result as yours when Initial Training command: Resume Training command: Evaluation command: This is the result from wandb (from resumed epoch (third) up to the fifth and last epoch): wandb: Waiting for W&B process to finish... (success). This is the result using {'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32), Thus, I am hypothesizing that the problem might be caused from the image size. Please let me know your thoughts on this as well. Thank you. |
I am laying out all the details here. Also, it is very odd that it is only happening with 320 size and not with 640. I had not expected that and did not test for that as well. Please try with 320 and In short, square training will resize the images into 320x320 or 640x640 depending on the value. Else, aspect ratio resizing happens depending on the size of |
Hi, I tried adding From wandfb: wandb: Waiting for W&B process to finish... (success). From {'classes': tensor([1, 2, 3, 4, 5, 6, 7], dtype=torch.int32), Let me confirm with |
Hi. I tried again with |
Hello. Can you open the training log file and check the best mAP results? I have a feeling that WandB is reporting the best model results while you are evaluating using the last model. I may be wrong though. Please keep me posted. |
Sure. This is for the |
Ok. I will take a look. |
Okay, thank you very much. |
Hi. I did a thorough check and it seems the issue is not with pycocotools/torchmetrics. Here are the evaluation results from both:
pycocotools
Somehow the last epoch results and the evaluation script results are not matching when image size is less than 640. At this point, I will need to dig deeper into what is happening. |
Hi. Thanks for your update. May I ask how did you use Torchmetrics to get the correct mAP metrics when the image size is less than 640? So I can try it myself while you're digging deeper to what is happening? |
No, I said the results are not matching when the image size is less than 640. When the image size is 640 or higher they are matching. I hope this is clear. Let me know if you have any more questions. |
Oh. Wait. Are you asking about the matching numbers between torchmetrics and pycocotools? |
Apologies, let me edit my reply there. Yes, I would like to try what you did as well. Did you still use |
So, I just used the latest version of torchmetrics which uses pycocotools as backend by default. You can update yours as well. it is more reliable. |
Okay, noted. Apologies as I am not too knowledgeable about the library/framework but may I ask if I shall still use the |
You can use Torchmetrics. It's pretty good now. |
After finishing 20 epochs, the terminal displayed this:
wandb: Run summary:
wandb: train_loss_box_reg 0.01077
wandb: train_loss_cls 0.00603
wandb: train_loss_epoch 0.01768
wandb: train_loss_iter 0.01653
wandb: train_loss_obj 0.00029
wandb: train_loss_rpn 0.00059
wandb: val_map_05 0.87624
wandb: val_map_05_95 0.58491
As can be seen, the [email protected] and [email protected] are relatively high for the last model state.
To verify it, I tried evaluating the last_model.pth using the eval.py script with this command:
python eval.py --weights outputs/training/custom_training/last_model.pth --data data_configs/custom_data.yaml --model fasterrcnn_resnet50_fpn_v2 --imgsz 320 --verbose
Which gave me this result:
{'classes': tensor([1, 2], dtype=torch.int32),
'map': tensor(0.3669),
'map_50': tensor(0.6886),
'map_75': tensor(0.3405),
'map_large': tensor(0.4444),
'map_medium': tensor(0.3408),
'map_per_class': tensor([0.4621, 0.2718]),
'map_small': tensor(0.2078),
'mar_1': tensor(0.4233),
'mar_10': tensor(0.4997),
'mar_100': tensor(0.5007),
'mar_100_per_class': tensor([0.6162, 0.3851]),
'mar_large': tensor(0.5776),
'mar_medium': tensor(0.4468),
'mar_small': tensor(0.3199)}
As can be seen, the resulting [email protected] and [email protected] were different from each other. The mAPs shown from the eval.py were lower.
Also, it seems that the script automatically uses the directory or path TEST_DIR_LABELS if it exists so I changed it to the directory or path of validation set like this:
TRAIN_DIR_IMAGES: 'custom_data/train'
TRAIN_DIR_LABELS: 'custom_data/train'
VALID_DIR_IMAGES: 'custom_data/valid'
VALID_DIR_LABELS: 'custom_data/valid'
TEST_DIR_IMAGES: 'custom_data/valid'
TEST_DIR_LABELS: 'custom_data/valid'
I am not sure where I went wrong. Do you happen to know the probable cause?
The text was updated successfully, but these errors were encountered: