Skip to content
This repository has been archived by the owner on Jun 15, 2022. It is now read-only.

confused about the train loss、size_average and the performance. #58

Open
chengcchn opened this issue Mar 4, 2020 · 6 comments
Open

Comments

@chengcchn
Copy link

Hi, @hirotomusiker.
I come here again. As the title said, I am confused about the train loss、size_average and the performance. I have train the original darknet repo and this repo on my own dataset (3 classes). And I want to share the results here.
The params are same: MAXITER: 6000, STEPS: (4800, 5400), IMGSIZE: 608 (both for train and test).
With darknet, I gain the [email protected] as 79.0, and the final loss was 0.76 (avg).
image
With this repo, the [email protected] was 76.9, and the final loss was 4.7 (total).
image
It seens that with this repo, the loss is harder to converge. So I changed the params for this repo (MAXITER: 8000, STEPS: (6400, 7200)), and gain the [email protected] as 78.3, and the final loss was 8.2 (total).
image
image
So I have some questions.

  1. the performance seens different, may be caused by the shuffle of the dataset?
  2. the loss of this repo is larger and harder to converge compared to the darknet. What's the reason?
  3. in #44, you haved talked about the param size_average and said that the loss of darknet is also high?
@hirotomusiker
Copy link
Contributor

hirotomusiker commented Mar 4, 2020

  1. I cannot reproduce your training but AP can randomly change if your dataset is not large enough and if the training has not converged. I recommend to plot the val AP and make sure your val AP has reached the plateau.
  2. Variation of loss values between iterations is large because number of GT objects affects loss.
  3. Logged loss of darknet (0.76 in your case) is batch-summed loss. If the batchsize is 64, darknet log-loss is 64x higher than ours. The loss value is only for logging and does not affect the training performance.

@chengcchn
Copy link
Author

Hi, @hirotomusiker. Sorry for the late reply.
I do as you said and gain a good result. However, I found there is no setting for repetition. So I add the seed setting before starting the training loop.

def setup_seed(seed):
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
image
But I faild to get the same result. Any suggestions?

@hirotomusiker
Copy link
Contributor

Thank you, I've tried your seed setting and got the same loss results.

@chengcchn
Copy link
Author

Yes, in the first several epochs, like 100~200, the loss seems the same. But there is still slightly different if you observe the decimal places, just as the images below:
Snipaste_2020-03-07_18-47-19
Snipaste_2020-03-07_18-47-28
And as the number of iterations increases, the loss difference becomes larger and larger, leading to the difference in the map.
Snipaste_2020-03-16_18-05-24
Snipaste_2020-03-16_18-05-56
I think this is due to the randomness of the underlying implementation of Pytorch, such as the cuda implementation of the up-sample layer. Any suggestions?

@hirotomusiker
Copy link
Contributor

I have tried again and checked 40 iterations on COCO:
1st:

[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]

2nd:

[Iter 0/500000] [lr 0.000000] [Losses: xy 43.622276, wh 16.042191, conf 67708.421875, cls 892.703674, total 25170.322266, imgsize 608]
[Iter 10/500000] [lr 0.000000] [Losses: xy 63.709991, wh 25.143564, conf 18768.097656, cls 1275.747925, total 7396.792969, imgsize 320]
[Iter 20/500000] [lr 0.000000] [Losses: xy 116.392715, wh 48.034309, conf 31668.382812, cls 2430.618652, total 12567.701172, imgsize 416]

The results are exactly the same.

  • Please set the learning rate = 0.0 and see what happens.
  • Please try it again with this repo without modification except the random seed part.

@Renascence6
Copy link

Hi,@chengcchn ,I want to know how you get the AP,I follow the author's instruction cann't evalute the trained moudle .

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants