Skip to content

sth strange when trainning #47

Open
@liwenssss

Description

@liwenssss

hi, I modify the train.sh as :

python train.py --name resnet_radvani_32000_20190415 --model resnet --netD conv-up --batch_size 4 --max_dataset_size 32000 --niter 20 --niter_decay 50 --save_result_freq 250 --save_epoch_freq 2 --ndown 6 --data_root /home/liwensh/data

and run about 16 hours(44 epoch), and at the end of log.txt shows:

epoch 43 iter 6499:  l1: 11.288669 tv: 1.522304 total: 11.288669
epoch 43 iter 6749:  l1: 11.599895 tv: 0.667862 total: 11.599895
epoch 43 iter 6999:  l1: 11.125267 tv: 1.277602 total: 11.125267
epoch 43 iter 7249:  l1: 11.893361 tv: 1.366742 total: 11.893361
epoch 43 iter 7499:  l1: 11.343329 tv: 1.228081 total: 11.343329
epoch 43 iter 7749:  l1: 11.397069 tv: 1.426213 total: 11.397069
epoch 43 iter 7999:  l1: 11.519998 tv: 0.664876 total: 11.519998
epoch 44 iter 249:  l1: 11.183926 tv: 1.258252 total: 11.183926
epoch 44 iter 499:  l1: 11.555054 tv: 1.201256 total: 11.555054
epoch 44 iter 749:  l1: 12.041154 tv: 1.312884 total: 12.041154
epoch 44 iter 999:  l1: 11.605458 tv: 0.706056 total: 11.605458
epoch 44 iter 1249:  l1: 11.589639 tv: 1.093558 total: 11.589639
epoch 44 iter 1499:  l1: 11.533211 tv: 1.338729 total: 11.533211
epoch 44 iter 1749:  l1: 11.822362 tv: 1.297630 total: 11.822362
epoch 44 iter 1999:  l1: 12.410873 tv: 1.159959 total: 12.410873
epoch 44 iter 2249:  l1: 11.855060 tv: 1.531642 total: 11.855060

the total have not changed much since the 5th epoch.
and the inter output is strange(44 epoch):
image
I wonder if it it because the batch size is too small, since I have no enough GPU memory. Or maybe other option set I am wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions