training failed to converge #37

omerbrandis · 2021-03-04T06:52:33Z

hello,

I've just tried training on my own dataset 4 classes , 24 images.
but after 90k iterations loss has not progressed passed 1.56:
iter: 90000 loss: 1.5658 (1.5819) loss_cls: 0.0000 (0.0114) loss_reg: 0.9772 (0.9775) loss_centerness: 0.5881 (0.5927) loss_mask: 0.0000 (0.0003) time: 0.3531 (0.3745) data: 0.0145 (0.0164) lr: 0.000010 max mem: 2017

there seems to have been very little progress since iter 5000
iter: 5000 loss: 1.5704 (1.8018) loss_cls: 0.0025 (0.2015) loss_reg: 0.9779 (0.9775) loss_centerness: 0.5880 (0.6171) loss_mask: 0.0000 (0.0057) time: 0.3798 (0.3738) data: 0.0161 (0.0164) lr: 0.001000 max mem: 2017

I'm using a modified sipmask_R_50_FPN_1x.yaml configuration:

FCOS with improvements

MODEL:
META_ARCHITECTURE: "GeneralizedRCNN"
RPN_ONLY: True
SIPMASK_ON: True
BACKBONE:
CONV_BODY: "R-50-FPN-RETINANET"
RESNETS:
BACKBONE_OUT_CHANNELS: 256
RETINANET:
USE_C5: False # FCOS uses P5 instead of C5x
NUM_CLASSES : 4
SIPMASK:
# normalizing the regression targets with FPN strides
NORM_REG_TARGETS: True
# positioning centerness on the regress branch.
# Please refer to tianzhi0549/FCOS#89 (comment)
CENTERNESS_ON_REG: True
# using center sampling and GIoU.
# Please refer to https://github.com/yqyao/FCOS_PLUS
CENTER_SAMPLING_RADIUS: 1.5
IOU_LOSS_TYPE: "giou"
NUM_CLASSES : 4
ROI_KEYPOINT_HEAD:
NUM_CLASSES : 4
ROI_KEYPOINT_HEAD:
NUM_CLASSES : 4
FCOS:
NUM_CLASSES : 4

DATASETS:
TRAIN: ("ffr2coco",)
TEST: ("ffr2coco",)
INPUT:
MIN_SIZE_TRAIN: (720,)
MAX_SIZE_TRAIN: 1280
MIN_SIZE_TEST: 720
MAX_SIZE_TEST: 1280
PIXEL_MEAN : [103, 103, 103]
PIXEL_STD: [66.0, 66.0, 66.0]
DATALOADER:
SIZE_DIVISIBILITY: 32
SOLVER:
BASE_LR: 0.001
WEIGHT_DECAY: 0.0001
STEPS: (60000, 80000)
MAX_ITER: 90000
IMS_PER_BATCH: 1
WARMUP_METHOD: "constant"
WARMUP_ITERS : 0
CHECKPOINT_PERIOD : 1000
TEST:
IMS_PER_BATCH : 1

any ideas ?

Omer.

JialeCao001 · 2021-03-04T09:11:47Z

@omerbrandis I am not sure about the problem. can you reduce the learning rate and try it again.

omerbrandis · 2021-03-04T09:26:11Z

I'm now trying with solver.IMS_PER_BATCH = 2.
( in the past i've seen issues with other maskRcnn_benchmark users , something about batch normalization requiring at least 2 images in order to train. )

it looks better now, has broken past the loss 1.56 barrier, training is still running and i'm not sure how far it will reach.
I'll update when ...
:-)
Omer.

omerbrandis · 2021-03-04T10:34:25Z

with solver.IMS_PER_BATCH = 2, training progressed nicely up to
iter: 6980 loss: 0.6838 (0.9119) loss_cls: 0.0003 (0.0445) loss_reg: 0.0465 (0.1292) loss_centerness: 0.5894 (0.5999) loss_mask: 0.0516
(0.1382) time: 0.7345 (0.8078) data: 0.0297 (0.0304) lr: 0.001000 max mem: 8542

but from there it did not make any significant progress.

i have a checkpoint from
iter: 10000 loss: 0.6815 (0.8467) loss_cls: 0.0002 (0.0312) loss_reg: 0.0485 (0.1055) loss_centerness: 0.5905 (0.5976) loss_mask: 0.0425 (0.1124) time: 0.7686 (0.7993) data: 0.0305 (0.0305) lr: 0.001000

it's test results are:
Evaluate annotation type bbox
DONE (t=0.38s).
Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.727
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.836
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.742
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.730
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.746
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.893
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.177
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.594
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.754
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.732
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.763
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.971
Maximum f-measures for classes:
[1.0, 0.9932432432432432, 0.6358695652173912]
Score thresholds for classes (used in demos for visualization purposes):
[0.7197750806808472, 0.5451260805130005, 0.5425870418548584]
Loading and preparing results...
DONE (t=0.02s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type segm
DONE (t=0.42s).
Accumulating evaluation results...
DONE (t=0.03s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.493
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.696
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.545
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.434
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.511
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.799
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.063
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.396
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.537
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.483
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.551
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.822

which is a noticeable improvement. :-)

is there a way to visualize inference ?
( I've tried demo/fcos_demo.py but was not successful)

Thanks,
Omer

omerbrandis · 2021-03-04T12:47:59Z

tried "fine tuning" using a lower LR, changed the base_lr in the config file and restarted training.
training logs show that the LR used is the original one and not the current value in the config file. ( I'm guessing that the optimizer object was loaded from the pth file and not recreated ....)

Omer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training failed to converge #37

training failed to converge #37

omerbrandis commented Mar 4, 2021

JialeCao001 commented Mar 4, 2021

omerbrandis commented Mar 4, 2021

omerbrandis commented Mar 4, 2021 •

edited

Loading

omerbrandis commented Mar 4, 2021

training failed to converge #37

training failed to converge #37

Comments

omerbrandis commented Mar 4, 2021

FCOS with improvements

JialeCao001 commented Mar 4, 2021

omerbrandis commented Mar 4, 2021

omerbrandis commented Mar 4, 2021 • edited Loading

omerbrandis commented Mar 4, 2021

omerbrandis commented Mar 4, 2021 •

edited

Loading