Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training failed to converge #37

Open
omerbrandis opened this issue Mar 4, 2021 · 4 comments
Open

training failed to converge #37

omerbrandis opened this issue Mar 4, 2021 · 4 comments

Comments

@omerbrandis
Copy link

hello,

I've just tried training on my own dataset 4 classes , 24 images.
but after 90k iterations loss has not progressed passed 1.56:
iter: 90000 loss: 1.5658 (1.5819) loss_cls: 0.0000 (0.0114) loss_reg: 0.9772 (0.9775) loss_centerness: 0.5881 (0.5927) loss_mask: 0.0000 (0.0003) time: 0.3531 (0.3745) data: 0.0145 (0.0164) lr: 0.000010 max mem: 2017

Evaluate annotation type bbox
DONE (t=0.94s).
Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
(test dataset matches train dataset)

there seems to have been very little progress since iter 5000
iter: 5000 loss: 1.5704 (1.8018) loss_cls: 0.0025 (0.2015) loss_reg: 0.9779 (0.9775) loss_centerness: 0.5880 (0.6171) loss_mask: 0.0000 (0.0057) time: 0.3798 (0.3738) data: 0.0161 (0.0164) lr: 0.001000 max mem: 2017

I'm using a modified sipmask_R_50_FPN_1x.yaml configuration:

FCOS with improvements

MODEL:
META_ARCHITECTURE: "GeneralizedRCNN"
RPN_ONLY: True
SIPMASK_ON: True
BACKBONE:
CONV_BODY: "R-50-FPN-RETINANET"
RESNETS:
BACKBONE_OUT_CHANNELS: 256
RETINANET:
USE_C5: False # FCOS uses P5 instead of C5x
NUM_CLASSES : 4
SIPMASK:
# normalizing the regression targets with FPN strides
NORM_REG_TARGETS: True
# positioning centerness on the regress branch.
# Please refer to tianzhi0549/FCOS#89 (comment)
CENTERNESS_ON_REG: True
# using center sampling and GIoU.
# Please refer to https://github.com/yqyao/FCOS_PLUS
CENTER_SAMPLING_RADIUS: 1.5
IOU_LOSS_TYPE: "giou"
NUM_CLASSES : 4
ROI_KEYPOINT_HEAD:
NUM_CLASSES : 4
ROI_KEYPOINT_HEAD:
NUM_CLASSES : 4
FCOS:
NUM_CLASSES : 4

DATASETS:
TRAIN: ("ffr2coco",)
TEST: ("ffr2coco",)
INPUT:
MIN_SIZE_TRAIN: (720,)
MAX_SIZE_TRAIN: 1280
MIN_SIZE_TEST: 720
MAX_SIZE_TEST: 1280
PIXEL_MEAN : [103, 103, 103]
PIXEL_STD: [66.0, 66.0, 66.0]
DATALOADER:
SIZE_DIVISIBILITY: 32
SOLVER:
BASE_LR: 0.001
WEIGHT_DECAY: 0.0001
STEPS: (60000, 80000)
MAX_ITER: 90000
IMS_PER_BATCH: 1
WARMUP_METHOD: "constant"
WARMUP_ITERS : 0
CHECKPOINT_PERIOD : 1000
TEST:
IMS_PER_BATCH : 1

any ideas ?

Omer.

@JialeCao001
Copy link
Owner

@omerbrandis I am not sure about the problem. can you reduce the learning rate and try it again.

@omerbrandis
Copy link
Author

I'm now trying with solver.IMS_PER_BATCH = 2.
( in the past i've seen issues with other maskRcnn_benchmark users , something about batch normalization requiring at least 2 images in order to train. )

it looks better now, has broken past the loss 1.56 barrier, training is still running and i'm not sure how far it will reach.
I'll update when ...
:-)
Omer.

@omerbrandis
Copy link
Author

omerbrandis commented Mar 4, 2021

with solver.IMS_PER_BATCH = 2, training progressed nicely up to
iter: 6980 loss: 0.6838 (0.9119) loss_cls: 0.0003 (0.0445) loss_reg: 0.0465 (0.1292) loss_centerness: 0.5894 (0.5999) loss_mask: 0.0516
(0.1382) time: 0.7345 (0.8078) data: 0.0297 (0.0304) lr: 0.001000 max mem: 8542

but from there it did not make any significant progress.

i have a checkpoint from
iter: 10000 loss: 0.6815 (0.8467) loss_cls: 0.0002 (0.0312) loss_reg: 0.0485 (0.1055) loss_centerness: 0.5905 (0.5976) loss_mask: 0.0425 (0.1124) time: 0.7686 (0.7993) data: 0.0305 (0.0305) lr: 0.001000

it's test results are:
Evaluate annotation type bbox
DONE (t=0.38s).
Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.727
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.836
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.742
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.730
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.746
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.893
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.177
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.594
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.754
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.732
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.763
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.971
Maximum f-measures for classes:
[1.0, 0.9932432432432432, 0.6358695652173912]
Score thresholds for classes (used in demos for visualization purposes):
[0.7197750806808472, 0.5451260805130005, 0.5425870418548584]
Loading and preparing results...
DONE (t=0.02s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type segm
DONE (t=0.42s).
Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.493
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.696
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.545
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.434
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.511
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.799
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.063
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.396
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.537
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.483
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.551
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.822

which is a noticeable improvement. :-)

is there a way to visualize inference ?
( I've tried demo/fcos_demo.py but was not successful)

Thanks,
Omer

@omerbrandis
Copy link
Author

tried "fine tuning" using a lower LR, changed the base_lr in the config file and restarted training.
training logs show that the LR used is the original one and not the current value in the config file. ( I'm guessing that the optimizer object was loaded from the pth file and not recreated ....)

Omer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants