You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to reproduce ICDAR 2015 result from paper.
But I can't get the result from paper with pre-trained weights.
I'm not changing any code. download dataset and pre-trained weights.
train with pre-trained weight. but I got loss almost 30.0~
it looks like not converge.
below is my log.
[07/25 14:21:07] detectron2 INFO: Rank of current process: 0. World size: 8
[07/25 14:21:11] detectron2 INFO: Environment info:
Hello, Thanks for your amazing work :)
I tried to reproduce ICDAR 2015 result from paper.
But I can't get the result from paper with pre-trained weights.
I'm not changing any code. download dataset and pre-trained weights.
train with pre-trained weight. but I got loss almost 30.0~
it looks like not converge.
below is my log.
[07/25 14:21:07] detectron2 INFO: Rank of current process: 0. World size: 8
[07/25 14:21:11] detectron2 INFO: Environment info:
sys.platform linux
Python 3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]
numpy 1.23.4
detectron2 0.6 @/usr/local/lib/python3.8/dist-packages/detectron2
Compiler GCC 9.4
CUDA compiler CUDA 11.3
detectron2 arch flags 8.6
DETECTRON2_ENV_MODULE
PyTorch 1.12.1+cu113 @/usr/local/lib/python3.8/dist-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI False
GPU available Yes
GPU 0,1,2,3,4,5,6,7 Tesla T4 (arch=7.5)
Driver version 450.80.02
CUDA_HOME /usr/local/cuda
Pillow 9.2.0
torchvision 0.13.1+cu113 @/usr/local/lib/python3.8/dist-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.1.2
PyTorch built with:
[07/25 14:21:11] detectron2 INFO: Command line arguments: Namespace(config_file='configs/TESTR/ICDAR15/TESTR_R_50_Polygon.yaml', dist_url='tcp://127.0.0.1:59588', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False)
[07/25 14:21:11] detectron2 INFO: Contents of args.config_file=configs/TESTR/ICDAR15/TESTR_R_50_Polygon.yaml:
BASE: "Base-ICDAR15-Polygon.yaml"
MODEL:
WEIGHTS: "weights/TESTR/pretrain_testr_R_50_polygon.pth"
RESNETS:
DEPTH: 50
TRANSFORMER:
NUM_FEATURE_LEVELS: 4
INFERENCE_TH_TEST: 0.3
ENC_LAYERS: 6
DEC_LAYERS: 6
DIM_FEEDFORWARD: 1024
HIDDEN_DIM: 256
DROPOUT: 0.1
NHEADS: 8
NUM_QUERIES: 100
ENC_N_POINTS: 4
DEC_N_POINTS: 4
SOLVER:
IMS_PER_BATCH: 8
BASE_LR: 1e-5
LR_BACKBONE: 1e-6
WARMUP_ITERS: 0
STEPS: (200000,)
MAX_ITER: 200000
CHECKPOINT_PERIOD: 10000
TEST:
EVAL_PERIOD: 10000
OUTPUT_DIR: "output/TESTR/icdar15/TESTR_R_50_Polygon"
[07/25 14:21:11] detectron2 INFO: Running with full config:
CUDNN_BENCHMARK: false
DATALOADER:
ASPECT_RATIO_GROUPING: true
FILTER_EMPTY_ANNOTATIONS: true
NUM_WORKERS: 4
REPEAT_THRESHOLD: 0.0
SAMPLER_TRAIN: TrainingSampler
DATASETS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
PROPOSAL_FILES_TEST: []
PROPOSAL_FILES_TRAIN: []
TEST:
TRAIN:
GLOBAL:
HACK: 1.0
INPUT:
CROP:
CROP_INSTANCE: false
ENABLED: true
SIZE:
TYPE: relative_range
FORMAT: RGB
HFLIP_TRAIN: false
MASK_FORMAT: polygon
MAX_SIZE_TEST: 4000
MAX_SIZE_TRAIN: 2333
MIN_SIZE_TEST: 1440
MIN_SIZE_TRAIN:
MIN_SIZE_TRAIN_SAMPLING: choice
RANDOM_FLIP: horizontal
MODEL:
ANCHOR_GENERATOR:
ANGLES:
ASPECT_RATIOS:
NAME: DefaultAnchorGenerator
OFFSET: 0.0
SIZES:
BACKBONE:
ANTI_ALIAS: false
FREEZE_AT: 2
NAME: build_resnet_backbone
BASIS_MODULE:
ANN_SET: coco
COMMON_STRIDE: 8
CONVS_DIM: 128
IN_FEATURES:
LOSS_ON: false
LOSS_WEIGHT: 0.3
NAME: ProtoNet
NORM: SyncBN
NUM_BASES: 4
NUM_CLASSES: 80
NUM_CONVS: 3
BATEXT:
CANONICAL_SIZE: 96
CONV_DIM: 256
CUSTOM_DICT: ''
IN_FEATURES:
NUM_CHARS: 25
NUM_CONV: 2
POOLER_RESOLUTION:
POOLER_SCALES:
RECOGNITION_LOSS: ctc
RECOGNIZER: attn
SAMPLING_RATIO: 1
USE_AET: false
USE_COORDCONV: false
VOC_SIZE: 96
BLENDMASK:
ATTN_SIZE: 14
BOTTOM_RESOLUTION: 56
INSTANCE_LOSS_WEIGHT: 1.0
POOLER_SAMPLING_RATIO: 1
POOLER_SCALES:
POOLER_TYPE: ROIAlignV2
TOP_INTERP: bilinear
VISUALIZE: false
BOXINST:
BOTTOM_PIXELS_REMOVED: 10
ENABLED: false
PAIRWISE:
COLOR_THRESH: 0.3
DILATION: 2
SIZE: 3
WARMUP_ITERS: 10000
BiFPN:
IN_FEATURES:
NORM: ''
NUM_REPEATS: 6
OUT_CHANNELS: 160
CONDINST:
BOTTOM_PIXELS_REMOVED: -1
MASK_BRANCH:
CHANNELS: 128
IN_FEATURES:
NORM: BN
NUM_CONVS: 4
OUT_CHANNELS: 8
SEMANTIC_LOSS_ON: false
MASK_HEAD:
CHANNELS: 8
DISABLE_REL_COORDS: false
NUM_LAYERS: 3
USE_FP16: false
MASK_OUT_STRIDE: 4
MAX_PROPOSALS: -1
TOPK_PROPOSALS_PER_IM: -1
DEVICE: cuda
DLA:
CONV_BODY: DLA34
NORM: FrozenBN
OUT_FEATURES:
FCOS:
BOX_QUALITY: ctrness
CENTER_SAMPLE: true
FPN_STRIDES:
INFERENCE_TH_TEST: 0.05
INFERENCE_TH_TRAIN: 0.05
IN_FEATURES:
LOC_LOSS_TYPE: giou
LOSS_ALPHA: 0.25
LOSS_GAMMA: 2.0
LOSS_NORMALIZER_CLS: fg
LOSS_WEIGHT_CLS: 1.0
NMS_TH: 0.6
NORM: GN
NUM_BOX_CONVS: 4
NUM_CLASSES: 80
NUM_CLS_CONVS: 4
NUM_SHARE_CONVS: 0
POST_NMS_TOPK_TEST: 100
POST_NMS_TOPK_TRAIN: 100
POS_RADIUS: 1.5
PRE_NMS_TOPK_TEST: 1000
PRE_NMS_TOPK_TRAIN: 1000
PRIOR_PROB: 0.01
SIZES_OF_INTEREST:
THRESH_WITH_CTR: false
TOP_LEVELS: 2
USE_DEFORMABLE: false
USE_RELU: true
USE_SCALE: true
YIELD_BOX_FEATURES: false
YIELD_PROPOSAL: false
FPN:
FUSE_TYPE: sum
IN_FEATURES: []
NORM: ''
OUT_CHANNELS: 256
KEYPOINT_ON: false
LOAD_PROPOSALS: false
MASK_ON: false
MEInst:
AGNOSTIC: true
CENTER_SAMPLE: true
DIM_MASK: 60
FLAG_PARAMETERS: false
FPN_STRIDES:
GCN_KERNEL_SIZE: 9
INFERENCE_TH_TEST: 0.05
INFERENCE_TH_TRAIN: 0.05
IN_FEATURES:
IOU_LABELS:
IOU_THRESHOLDS:
LAST_DEFORMABLE: false
LOC_LOSS_TYPE: giou
LOSS_ALPHA: 0.25
LOSS_GAMMA: 2.0
LOSS_ON_MASK: false
MASK_LOSS_TYPE: mse
MASK_ON: true
MASK_SIZE: 28
NMS_TH: 0.6
NORM: GN
NUM_BOX_CONVS: 4
NUM_CLASSES: 80
NUM_CLS_CONVS: 4
NUM_MASK_CONVS: 4
NUM_SHARE_CONVS: 0
PATH_COMPONENTS: datasets/coco/components/coco_2017_train_class_agnosticTrue_whitenTrue_sigmoidTrue_60.npz
POST_NMS_TOPK_TEST: 100
POST_NMS_TOPK_TRAIN: 100
POS_RADIUS: 1.5
PRE_NMS_TOPK_TEST: 1000
PRE_NMS_TOPK_TRAIN: 1000
PRIOR_PROB: 0.01
SIGMOID: true
SIZES_OF_INTEREST:
THRESH_WITH_CTR: false
TOP_LEVELS: 2
TYPE_DEFORMABLE: DCNv1
USE_DEFORMABLE: false
USE_GCN_IN_MASK: false
USE_RELU: true
USE_SCALE: true
WHITEN: true
META_ARCHITECTURE: TransformerDetector
MOBILENET: false
PANOPTIC_FPN:
COMBINE:
ENABLED: true
INSTANCES_CONFIDENCE_THRESH: 0.5
OVERLAP_THRESH: 0.5
STUFF_AREA_LIMIT: 4096
INSTANCE_LOSS_WEIGHT: 1.0
PIXEL_MEAN:
PIXEL_STD:
PROPOSAL_GENERATOR:
MIN_SIZE: 0
NAME: RPN
RESNETS:
DEFORM_INTERVAL: 1
DEFORM_MODULATED: false
DEFORM_NUM_GROUPS: 1
DEFORM_ON_PER_STAGE:
DEPTH: 50
NORM: FrozenBN
NUM_GROUPS: 1
OUT_FEATURES:
RES2_OUT_CHANNELS: 256
RES5_DILATION: 1
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: false
WIDTH_PER_GROUP: 64
RETINANET:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_WEIGHTS: &id002
FOCAL_LOSS_ALPHA: 0.25
FOCAL_LOSS_GAMMA: 2.0
IN_FEATURES:
IOU_LABELS:
IOU_THRESHOLDS:
NMS_THRESH_TEST: 0.5
NORM: ''
NUM_CLASSES: 80
NUM_CONVS: 4
PRIOR_PROB: 0.01
SCORE_THRESH_TEST: 0.05
SMOOTH_L1_LOSS_BETA: 0.1
TOPK_CANDIDATES_TEST: 1000
ROI_BOX_CASCADE_HEAD:
BBOX_REG_WEIGHTS:
IOUS:
ROI_BOX_HEAD:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS: *id001
CLS_AGNOSTIC_BBOX_REG: false
CONV_DIM: 256
FC_DIM: 1024
FED_LOSS_FREQ_WEIGHT_POWER: 0.5
FED_LOSS_NUM_CLASSES: 50
NAME: ''
NORM: ''
NUM_CONV: 0
NUM_FC: 0
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
SMOOTH_L1_BETA: 0.0
TRAIN_ON_PRED_BOXES: false
USE_FED_LOSS: false
USE_SIGMOID_CE: false
ROI_HEADS:
BATCH_SIZE_PER_IMAGE: 512
IN_FEATURES:
IOU_LABELS:
IOU_THRESHOLDS:
NAME: Res5ROIHeads
NMS_THRESH_TEST: 0.5
NUM_CLASSES: 80
POSITIVE_FRACTION: 0.25
PROPOSAL_APPEND_GT: true
SCORE_THRESH_TEST: 0.05
ROI_KEYPOINT_HEAD:
CONV_DIMS:
LOSS_WEIGHT: 1.0
MIN_KEYPOINTS_PER_IMAGE: 1
NAME: KRCNNConvDeconvUpsampleHead
NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
NUM_KEYPOINTS: 17
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
ROI_MASK_HEAD:
CLS_AGNOSTIC_MASK: false
CONV_DIM: 256
NAME: MaskRCNNConvUpsampleHead
NORM: ''
NUM_CONV: 0
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
RPN:
BATCH_SIZE_PER_IMAGE: 256
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS: *id002
BOUNDARY_THRESH: -1
CONV_DIMS:
HEAD_NAME: StandardRPNHead
IN_FEATURES:
IOU_LABELS:
IOU_THRESHOLDS:
LOSS_WEIGHT: 1.0
NMS_THRESH: 0.7
POSITIVE_FRACTION: 0.5
POST_NMS_TOPK_TEST: 1000
POST_NMS_TOPK_TRAIN: 2000
PRE_NMS_TOPK_TEST: 6000
PRE_NMS_TOPK_TRAIN: 12000
SMOOTH_L1_BETA: 0.0
SEM_SEG_HEAD:
COMMON_STRIDE: 4
CONVS_DIM: 128
IGNORE_VALUE: 255
IN_FEATURES:
LOSS_WEIGHT: 1.0
NAME: SemSegFPNHead
NORM: GN
NUM_CLASSES: 54
SOLOV2:
FPN_INSTANCE_STRIDES:
FPN_SCALE_RANGES:
INSTANCE_CHANNELS: 512
INSTANCE_IN_CHANNELS: 256
INSTANCE_IN_FEATURES:
LOSS:
DICE_WEIGHT: 3.0
FOCAL_ALPHA: 0.25
FOCAL_GAMMA: 2.0
FOCAL_USE_SIGMOID: true
FOCAL_WEIGHT: 1.0
MASK_CHANNELS: 128
MASK_IN_CHANNELS: 256
MASK_IN_FEATURES:
MASK_THR: 0.5
MAX_PER_IMG: 100
NMS_KERNEL: gaussian
NMS_PRE: 500
NMS_SIGMA: 2
NMS_TYPE: matrix
NORM: GN
NUM_CLASSES: 80
NUM_GRIDS:
NUM_INSTANCE_CONVS: 4
NUM_KERNELS: 256
NUM_MASKS: 256
PRIOR_PROB: 0.01
SCORE_THR: 0.1
SIGMA: 0.2
TYPE_DCN: DCN
UPDATE_THR: 0.05
USE_COORD_CONV: true
USE_DCN_IN_INSTANCE: false
TOP_MODULE:
DIM: 16
NAME: conv
TRANSFORMER:
AUX_LOSS: true
DEC_LAYERS: 6
DEC_N_POINTS: 4
DIM_FEEDFORWARD: 1024
DROPOUT: 0.1
ENABLED: true
ENC_LAYERS: 6
ENC_N_POINTS: 4
HIDDEN_DIM: 256
INFERENCE_TH_TEST: 0.3
LOSS:
AUX_LOSS: true
BOX_CLASS_WEIGHT: 2.0
BOX_COORD_WEIGHT: 5.0
BOX_GIOU_WEIGHT: 2.0
FOCAL_ALPHA: 0.25
FOCAL_GAMMA: 2.0
POINT_CLASS_WEIGHT: 2.0
POINT_COORD_WEIGHT: 5.0
POINT_TEXT_WEIGHT: 4.0
NHEADS: 8
NUM_CHARS: 25
NUM_CTRL_POINTS: 16
NUM_FEATURE_LEVELS: 4
NUM_QUERIES: 100
POSITION_EMBEDDING_SCALE: 6.283185307179586
USE_POLYGON: true
VOC_SIZE: 96
VOVNET:
BACKBONE_OUT_CHANNELS: 256
CONV_BODY: V-39-eSE
NORM: FrozenBN
OUT_CHANNELS: 256
OUT_FEATURES:
WEIGHTS: weights/TESTR/pretrain_testr_R_50_polygon.pth
OUTPUT_DIR: output/TESTR/icdar15/TESTR_R_50_Polygon
SEED: -1
SOLVER:
AMP:
ENABLED: false
BASE_LR: 1.0e-05
BASE_LR_END: 0.0
BIAS_LR_FACTOR: 1.0
CHECKPOINT_PERIOD: 10000
CLIP_GRADIENTS:
CLIP_TYPE: full_model
CLIP_VALUE: 0.1
ENABLED: true
NORM_TYPE: 2.0
GAMMA: 0.1
IMS_PER_BATCH: 8
LR_BACKBONE: 1.0e-06
LR_BACKBONE_NAMES:
LR_LINEAR_PROJ_MULT: 0.1
LR_LINEAR_PROJ_NAMES:
LR_SCHEDULER_NAME: WarmupMultiStepLR
MAX_ITER: 200000
MOMENTUM: 0.9
NESTEROV: false
NUM_DECAYS: 3
OPTIMIZER: ADAMW
REFERENCE_WORLD_SIZE: 0
RESCALE_INTERVAL: false
STEPS:
WARMUP_FACTOR: 0.001
WARMUP_ITERS: 0
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.0001
WEIGHT_DECAY_BIAS: null
WEIGHT_DECAY_NORM: 0.0
TEST:
AUG:
ENABLED: false
FLIP: true
MAX_SIZE: 4000
MIN_SIZES:
DETECTIONS_PER_IMAGE: 100
EVAL_PERIOD: 10000
EXPECTED_RESULTS: []
KEYPOINT_OKS_SIGMAS: []
LEXICON_TYPE: 3
PRECISE_BN:
ENABLED: false
NUM_ITER: 200
USE_LEXICON: true
WEIGHTED_EDIT_DIST: true
VERSION: 2
VIS_PERIOD: 0
[07/25 14:21:11] detectron2 INFO: Full config saved to output/TESTR/icdar15/TESTR_R_50_Polygon/config.yaml
[07/25 14:21:11] d2.utils.env INFO: Using a generated random seed 11819301
[07/25 14:21:13] d2.engine.defaults INFO: Model:
TransformerDetector(
(testr): TESTR(
(backbone): Joiner(
(0): MaskedBackbone(
(backbone): ResNet(
(stem): BasicStem(
(conv1): Conv2d(
3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
)
(res2): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv1): Conv2d(
64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
)
(res3): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv1): Conv2d(
256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
)
(res4): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
(conv1): Conv2d(
512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(4): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(5): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
)
(res5): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
(conv1): Conv2d(
1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
)
)
)
(1): PositionalEncoding2D()
)
(text_pos_embed): PositionalEncoding1D()
(transformer): DeformableTransformer(
(encoder): DeformableTransformerEncoder(
(layers): ModuleList(
(0): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): DeformableTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): DeformableCompositeTransformerDecoder(
(layers): ModuleList(
(0): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): DeformableCompositeTransformerDecoderLayer(
(attn_cross): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross): Dropout(p=0.1, inplace=False)
(norm_cross): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra): Dropout(p=0.1, inplace=False)
(norm_intra): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter): Dropout(p=0.1, inplace=False)
(norm_inter): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout3): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout4): Dropout(p=0.1, inplace=False)
(norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_intra_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_intra_text): Dropout(p=0.1, inplace=False)
(norm_intra_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_inter_text): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(dropout_inter_text): Dropout(p=0.1, inplace=False)
(norm_inter_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(attn_cross_text): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=256, bias=True)
(attention_weights): Linear(in_features=256, out_features=128, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout_cross_text): Dropout(p=0.1, inplace=False)
(norm_cross_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1_text): Linear(in_features=256, out_features=1024, bias=True)
(dropout3_text): Dropout(p=0.1, inplace=False)
(linear2_text): Linear(in_features=1024, out_features=256, bias=True)
(dropout4_text): Dropout(p=0.1, inplace=False)
(norm3_text): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
(enc_output): Linear(in_features=256, out_features=256, bias=True)
(enc_output_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(pos_trans): Linear(in_features=256, out_features=256, bias=True)
(pos_trans_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(bbox_class_embed): Linear(in_features=256, out_features=1, bias=True)
(bbox_embed): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
)
(ctrl_point_class): ModuleList(
(0): Linear(in_features=256, out_features=1, bias=True)
(1): Linear(in_features=256, out_features=1, bias=True)
(2): Linear(in_features=256, out_features=1, bias=True)
(3): Linear(in_features=256, out_features=1, bias=True)
(4): Linear(in_features=256, out_features=1, bias=True)
(5): Linear(in_features=256, out_features=1, bias=True)
)
(ctrl_point_coord): ModuleList(
(0): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(1): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(2): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(3): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(4): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
(5): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=2, bias=True)
)
)
)
(bbox_coord): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=4, bias=True)
)
)
(bbox_class): Linear(in_features=256, out_features=1, bias=True)
(text_class): Linear(in_features=256, out_features=97, bias=True)
(ctrl_point_embed): Embedding(16, 256)
(text_embed): Embedding(25, 256)
(input_proj): ModuleList(
(0): Sequential(
(0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(1): Sequential(
(0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(2): Sequential(
(0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(3): Sequential(
(0): Conv2d(2048, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
)
)
(criterion): SetCriterion(
(enc_matcher): BoxHungarianMatcher()
(dec_matcher): CtrlPointHungarianMatcher()
)
)
[07/25 14:21:13] d2.data.dataset_mapper INFO: [DatasetMapper] Augmentations used in training: [RandomCrop(crop_type='relative_range', crop_size=[0.1, 0.1]), ResizeShortestEdge(short_edge_length=(800, 832, 864, 896, 1000, 1200, 1400), max_size=2333, sample_style='choice'), RandomFlip()]
[07/25 14:21:13] adet.data.dataset_mapper INFO: Rebuilding the augmentations. The previous augmentations will be overridden.
[07/25 14:21:13] adet.data.detection_utils INFO: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(800, 832, 864, 896, 1000, 1200, 1400), max_size=2333, sample_style='choice')]
[07/25 14:21:13] adet.data.dataset_mapper INFO: Cropping used in training: RandomCropWithInstance(crop_type='relative_range', crop_size=[0.1, 0.1], crop_instance=False)
[07/25 14:21:13] adet.data.datasets.text INFO: Loaded 1000 images in COCO format from datasets/icdar2015/train_poly.json
[07/25 14:21:13] d2.data.build INFO: Removed 21 images with no usable annotations. 979 images left.
[07/25 14:21:13] d2.data.build INFO: Distribution of instances among all 1 categories:
�[36m| category | #instances |
|:----------:|:-------------|
| text | 4468 |
| | |�[0m
[07/25 14:21:13] d2.data.build INFO: Using training sampler TrainingSampler
[07/25 14:21:13] d2.data.common INFO: Serializing the dataset using: <class 'detectron2.data.common.TorchSerializedList'>
[07/25 14:21:13] d2.data.common INFO: Serializing 979 elements to byte tensors and concatenating them all ...
[07/25 14:21:13] d2.data.common INFO: Serialized dataset takes 1.64 MiB
[07/25 14:21:13] d2.checkpoint.detection_checkpoint INFO: [DetectionCheckpointer] Loading from weights/TESTR/pretrain_testr_R_50_polygon.pth ...
[07/25 14:21:13] fvcore.common.checkpoint INFO: [Checkpointer] Loading from weights/TESTR/pretrain_testr_R_50_polygon.pth ...
[07/25 14:21:14] adet.trainer INFO: Starting training from iteration 0
[07/25 17:20:06] d2.utils.events INFO: eta: 2 days, 13:01:22 iter: 9359 total_loss: 44.08 loss_ce: 0.783 loss_ctrl_points: 2.31 loss_texts: 3.764 loss_ce_0: 0.8143 loss_ctrl_points_0: 2.423 loss_texts_0: 3.801 loss_ce_1: 0.8142 loss_ctrl_points_1: 2.4 loss_texts_1: 3.759 loss_ce_2: 0.8032 loss_ctrl_points_2: 2.351 loss_texts_2: 3.756 loss_ce_3: 0.7866 loss_ctrl_points_3: 2.334 loss_texts_3: 3.758 loss_ce_4: 0.7786 loss_ctrl_points_4: 2.311 loss_texts_4: 3.77 loss_ce_enc: 0.8066 loss_bbox_enc: 0.3008 loss_giou_enc: 0.7569 time: 1.1431 last_time: 0.8115 data_time: 0.0088 last_data_time: 0.0066 lr: 1e-05 max_mem: 12183M
[07/25 17:20:28] d2.utils.events INFO: eta: 2 days, 13:02:11 iter: 9379 total_loss: 42.63 loss_ce: 0.7653 loss_ctrl_points: 2.407 loss_texts: 3.758 loss_ce_0: 0.8062 loss_ctrl_points_0: 2.635 loss_texts_0: 3.792 loss_ce_1: 0.7863 loss_ctrl_points_1: 2.568 loss_texts_1: 3.736 loss_ce_2: 0.7788 loss_ctrl_points_2: 2.537 loss_texts_2: 3.737 loss_ce_3: 0.77 loss_ctrl_points_3: 2.508 loss_texts_3: 3.748 loss_ce_4: 0.7641 loss_ctrl_points_4: 2.456 loss_texts_4: 3.748 loss_ce_enc: 0.7962 loss_bbox_enc: 0.2918 loss_giou_enc: 0.73 time: 1.1431 last_time: 0.9134 data_time: 0.0084 last_data_time: 0.0075 lr: 1e-05 max_mem: 12183M
[07/25 17:20:51] d2.utils.events INFO: eta: 2 days, 13:05:45 iter: 9399 total_loss: 44.09 loss_ce: 0.7944 loss_ctrl_points: 2.32 loss_texts: 3.633 loss_ce_0: 0.8154 loss_ctrl_points_0: 2.634 loss_texts_0: 3.668 loss_ce_1: 0.802 loss_ctrl_points_1: 2.506 loss_texts_1: 3.633 loss_ce_2: 0.8023 loss_ctrl_points_2: 2.369 loss_texts_2: 3.626 loss_ce_3: 0.7987 loss_ctrl_points_3: 2.281 loss_texts_3: 3.624 loss_ce_4: 0.7966 loss_ctrl_points_4: 2.309 loss_texts_4: 3.62 loss_ce_enc: 0.8003 loss_bbox_enc: 0.2937 loss_giou_enc: 0.7454 time: 1.1431 last_time: 1.1894 data_time: 0.0081 last_data_time: 0.0227 lr: 1e-05 max
The text was updated successfully, but these errors were encountered: