Build model using fasterrcnn_mobilenetv3_large_fpn #147

WawanFirgiawan · 2024-06-24T15:27:12Z

I want to run my training process with the command:

!python train.py --data data_configs/data_training.yaml --epochs 40 --model fasterrcnn_mobilenetv3_large_fpn --project-dir fasterrcnn_mobilenetv3_large_fpn --seed 8

and I get an error in my program as follows:

2024-06-24 15:23:20.794655: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-24 15:23:20.794717: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-24 15:23:20.796062: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-24 15:23:20.803158: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-24 15:23:21.919523: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Not using distributed mode
wandb: Currently logged in as: pusatstudiaiunsulbar (pusatsudiaiusb). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.2
wandb: Run data is saved locally in /content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/wandb/run-20240624_152326-bw79izjd
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run expert-fire-4
wandb: ⭐️ View project at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline
wandb: 🚀 View run at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd
device cuda
Checking Labels and images...
100% 886/886 [00:00<00:00, 116878.55it/s]
Checking Labels and images...
0it [00:00, ?it/s]
Creating data loaders
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Number of training samples: 886
Number of validation samples: 0

Building model from scratch...

Layer (type (var_name)) Input Shape Output Shape Param #

FasterRCNN (FasterRCNN) [4, 3, 640, 640] [0, 4] --
├─GeneralizedRCNNTransform (transform) [4, 3, 640, 640] [4, 3, 640, 640] --
├─BackboneWithFPN (backbone) [4, 3, 640, 640] [4, 256, 10, 10] --
│ └─IntermediateLayerGetter (body) [4, 3, 640, 640] [4, 960, 20, 20] --
│ │ └─Conv2dNormActivation (0) [4, 3, 640, 640] [4, 16, 320, 320] (432)
│ │ └─InvertedResidual (1) [4, 16, 320, 320] [4, 16, 320, 320] (400)
│ │ └─InvertedResidual (2) [4, 16, 320, 320] [4, 24, 160, 160] (3,136)
│ │ └─InvertedResidual (3) [4, 24, 160, 160] [4, 24, 160, 160] (4,104)
│ │ └─InvertedResidual (4) [4, 24, 160, 160] [4, 40, 80, 80] (9,960)
│ │ └─InvertedResidual (5) [4, 40, 80, 80] [4, 40, 80, 80] (20,432)
│ │ └─InvertedResidual (6) [4, 40, 80, 80] [4, 40, 80, 80] (20,432)
│ │ └─InvertedResidual (7) [4, 40, 80, 80] [4, 80, 40, 40] 30,960
│ │ └─InvertedResidual (8) [4, 80, 40, 40] [4, 80, 40, 40] 33,800
│ │ └─InvertedResidual (9) [4, 80, 40, 40] [4, 80, 40, 40] 31,096
│ │ └─InvertedResidual (10) [4, 80, 40, 40] [4, 80, 40, 40] 31,096
│ │ └─InvertedResidual (11) [4, 80, 40, 40] [4, 112, 40, 40] 212,280
│ │ └─InvertedResidual (12) [4, 112, 40, 40] [4, 112, 40, 40] 383,208
│ │ └─InvertedResidual (13) [4, 112, 40, 40] [4, 160, 20, 20] 426,216
│ │ └─InvertedResidual (14) [4, 160, 20, 20] [4, 160, 20, 20] 793,200
│ │ └─InvertedResidual (15) [4, 160, 20, 20] [4, 160, 20, 20] 793,200
│ │ └─Conv2dNormActivation (16) [4, 160, 20, 20] [4, 960, 20, 20] 153,600
│ └─FeaturePyramidNetwork (fpn) [4, 160, 20, 20] [4, 256, 10, 10] --
│ │ └─ModuleList (inner_blocks) -- -- (recursive)
│ │ └─ModuleList (layer_blocks) -- -- (recursive)
│ │ └─ModuleList (inner_blocks) -- -- (recursive)
│ │ └─ModuleList (layer_blocks) -- -- (recursive)
│ │ └─LastLevelMaxPool (extra_blocks) [4, 256, 20, 20] [4, 256, 20, 20] --
├─RegionProposalNetwork (rpn) [4, 3, 640, 640] [0, 4] --
│ └─RPNHead (head) [4, 256, 20, 20] [4, 15, 20, 20] --
│ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] 590,080
│ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] 3,855
│ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] 15,420
│ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] (recursive)
│ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] (recursive)
│ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] (recursive)
│ │ └─Sequential (conv) [4, 256, 10, 10] [4, 256, 10, 10] (recursive)
│ │ └─Conv2d (cls_logits) [4, 256, 10, 10] [4, 15, 10, 10] (recursive)
│ │ └─Conv2d (bbox_pred) [4, 256, 10, 10] [4, 60, 10, 10] (recursive)
│ └─AnchorGenerator (anchor_generator) [4, 3, 640, 640] [13500, 4] --
├─RoIHeads (roi_heads) [4, 256, 20, 20] [0, 4] --
│ └─MultiScaleRoIAlign (box_roi_pool) [4, 256, 20, 20] [0, 256, 7, 7] --
│ └─TwoMLPHead (box_head) [0, 256, 7, 7] [0, 1024] --
│ │ └─Linear (fc6) [0, 12544] [0, 1024] 12,846,080
│ │ └─Linear (fc7) [0, 1024] [0, 1024] 1,049,600
│ └─FastRCNNPredictor (box_predictor) [0, 1024] [0, 3] --
│ │ └─Linear (cls_score) [0, 1024] [0, 3] 3,075
│ │ └─Linear (bbox_pred) [0, 1024] [0, 12] 12,300

Total params: 18,935,354
Trainable params: 18,876,458
Non-trainable params: 58,896
Total mult-adds (G): 11.49

Input size (MB): 19.66
Forward/backward pass size (MB): 1172.14
Params size (MB): 75.74
Estimated Total Size (MB): 1267.54

18,935,354 total parameters.
18,876,458 training parameters.
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
Epoch: [0] [ 0/222] eta: 0:11:02 lr: 0.000006 loss: 1.8196 (1.8196) loss_classifier: 1.4352 (1.4352) loss_box_reg: 0.3557 (0.3557) loss_objectness: 0.0227 (0.0227) loss_rpn_box_reg: 0.0060 (0.0060) time: 2.9830 data: 1.9134 max mem: 704
Epoch: [0] [100/222] eta: 0:00:22 lr: 0.000458 loss: 1.2597 (1.3672) loss_classifier: 0.5182 (0.6553) loss_box_reg: 0.7019 (0.6994) loss_objectness: 0.0014 (0.0098) loss_rpn_box_reg: 0.0025 (0.0027) time: 0.1611 data: 0.0257 max mem: 811
Epoch: [0] [200/222] eta: 0:00:03 lr: 0.000910 loss: 0.8597 (1.1901) loss_classifier: 0.2865 (0.5291) loss_box_reg: 0.5280 (0.6531) loss_objectness: 0.0006 (0.0057) loss_rpn_box_reg: 0.0013 (0.0023) time: 0.1735 data: 0.0235 max mem: 811
Epoch: [0] [221/222] eta: 0:00:00 lr: 0.001000 loss: 0.8436 (1.1645) loss_classifier: 0.3099 (0.5145) loss_box_reg: 0.5193 (0.6426) loss_objectness: 0.0005 (0.0053) loss_rpn_box_reg: 0.0012 (0.0022) time: 0.1591 data: 0.0203 max mem: 811
Epoch: [0] Total time: 0:00:34 (0.1552 s / it)
creating index...
index created!
Traceback (most recent call last):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in
main(args)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main
stats, val_pred_image = evaluate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate
for images, targets in metric_logger.log_every(data_loader, 100, header):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every
log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in
main(args)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main
stats, val_pred_image = evaluate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate
for images, targets in metric_logger.log_every(data_loader, 100, header):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every
log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero
wandb: 🚀 View run expert-fire-4 at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd
wandb: ⭐️ View project at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240624_152326-bw79izjd/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

The text was updated successfully, but these errors were encountered:

maexrakete · 2024-08-28T15:54:04Z

Number of validation samples: 0

looks like you lack validation data

fyi: if you wrap your error log into triple backticks it would be way more readable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build model using fasterrcnn_mobilenetv3_large_fpn #147

Build model using fasterrcnn_mobilenetv3_large_fpn #147

WawanFirgiawan commented Jun 24, 2024

maexrakete commented Aug 28, 2024

Build model using fasterrcnn_mobilenetv3_large_fpn #147

Build model using fasterrcnn_mobilenetv3_large_fpn #147

Comments

WawanFirgiawan commented Jun 24, 2024

Building model from scratch...

Layer (type (var_name)) Input Shape Output Shape Param #

Total params: 18,935,354 Trainable params: 18,876,458 Non-trainable params: 58,896 Total mult-adds (G): 11.49

Input size (MB): 19.66 Forward/backward pass size (MB): 1172.14 Params size (MB): 75.74 Estimated Total Size (MB): 1267.54

maexrakete commented Aug 28, 2024

Total params: 18,935,354
Trainable params: 18,876,458
Non-trainable params: 58,896
Total mult-adds (G): 11.49

Input size (MB): 19.66
Forward/backward pass size (MB): 1172.14
Params size (MB): 75.74
Estimated Total Size (MB): 1267.54