Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build model using fasterrcnn_mobilenetv3_large_fpn #147

Open
WawanFirgiawan opened this issue Jun 24, 2024 · 1 comment
Open

Build model using fasterrcnn_mobilenetv3_large_fpn #147

WawanFirgiawan opened this issue Jun 24, 2024 · 1 comment

Comments

@WawanFirgiawan
Copy link

I want to run my training process with the command:

!python train.py --data data_configs/data_training.yaml --epochs 40 --model fasterrcnn_mobilenetv3_large_fpn --project-dir fasterrcnn_mobilenetv3_large_fpn --seed 8

and I get an error in my program as follows:

2024-06-24 15:23:20.794655: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-24 15:23:20.794717: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-24 15:23:20.796062: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-24 15:23:20.803158: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-24 15:23:21.919523: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Not using distributed mode
wandb: Currently logged in as: pusatstudiaiunsulbar (pusatsudiaiusb). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.2
wandb: Run data is saved locally in /content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/wandb/run-20240624_152326-bw79izjd
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run expert-fire-4
wandb: ⭐️ View project at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline
wandb: 🚀 View run at https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd
device cuda
Checking Labels and images...
100% 886/886 [00:00<00:00, 116878.55it/s]
Checking Labels and images...
0it [00:00, ?it/s]
Creating data loaders
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
Number of training samples: 886
Number of validation samples: 0

Building model from scratch...

Layer (type (var_name)) Input Shape Output Shape Param #

FasterRCNN (FasterRCNN) [4, 3, 640, 640] [0, 4] --
├─GeneralizedRCNNTransform (transform) [4, 3, 640, 640] [4, 3, 640, 640] --
├─BackboneWithFPN (backbone) [4, 3, 640, 640] [4, 256, 10, 10] --
│ └─IntermediateLayerGetter (body) [4, 3, 640, 640] [4, 960, 20, 20] --
│ │ └─Conv2dNormActivation (0) [4, 3, 640, 640] [4, 16, 320, 320] (432)
│ │ └─InvertedResidual (1) [4, 16, 320, 320] [4, 16, 320, 320] (400)
│ │ └─InvertedResidual (2) [4, 16, 320, 320] [4, 24, 160, 160] (3,136)
│ │ └─InvertedResidual (3) [4, 24, 160, 160] [4, 24, 160, 160] (4,104)
│ │ └─InvertedResidual (4) [4, 24, 160, 160] [4, 40, 80, 80] (9,960)
│ │ └─InvertedResidual (5) [4, 40, 80, 80] [4, 40, 80, 80] (20,432)
│ │ └─InvertedResidual (6) [4, 40, 80, 80] [4, 40, 80, 80] (20,432)
│ │ └─InvertedResidual (7) [4, 40, 80, 80] [4, 80, 40, 40] 30,960
│ │ └─InvertedResidual (8) [4, 80, 40, 40] [4, 80, 40, 40] 33,800
│ │ └─InvertedResidual (9) [4, 80, 40, 40] [4, 80, 40, 40] 31,096
│ │ └─InvertedResidual (10) [4, 80, 40, 40] [4, 80, 40, 40] 31,096
│ │ └─InvertedResidual (11) [4, 80, 40, 40] [4, 112, 40, 40] 212,280
│ │ └─InvertedResidual (12) [4, 112, 40, 40] [4, 112, 40, 40] 383,208
│ │ └─InvertedResidual (13) [4, 112, 40, 40] [4, 160, 20, 20] 426,216
│ │ └─InvertedResidual (14) [4, 160, 20, 20] [4, 160, 20, 20] 793,200
│ │ └─InvertedResidual (15) [4, 160, 20, 20] [4, 160, 20, 20] 793,200
│ │ └─Conv2dNormActivation (16) [4, 160, 20, 20] [4, 960, 20, 20] 153,600
│ └─FeaturePyramidNetwork (fpn) [4, 160, 20, 20] [4, 256, 10, 10] --
│ │ └─ModuleList (inner_blocks) -- -- (recursive)
│ │ └─ModuleList (layer_blocks) -- -- (recursive)
│ │ └─ModuleList (inner_blocks) -- -- (recursive)
│ │ └─ModuleList (layer_blocks) -- -- (recursive)
│ │ └─LastLevelMaxPool (extra_blocks) [4, 256, 20, 20] [4, 256, 20, 20] --
├─RegionProposalNetwork (rpn) [4, 3, 640, 640] [0, 4] --
│ └─RPNHead (head) [4, 256, 20, 20] [4, 15, 20, 20] --
│ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] 590,080
│ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] 3,855
│ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] 15,420
│ │ └─Sequential (conv) [4, 256, 20, 20] [4, 256, 20, 20] (recursive)
│ │ └─Conv2d (cls_logits) [4, 256, 20, 20] [4, 15, 20, 20] (recursive)
│ │ └─Conv2d (bbox_pred) [4, 256, 20, 20] [4, 60, 20, 20] (recursive)
│ │ └─Sequential (conv) [4, 256, 10, 10] [4, 256, 10, 10] (recursive)
│ │ └─Conv2d (cls_logits) [4, 256, 10, 10] [4, 15, 10, 10] (recursive)
│ │ └─Conv2d (bbox_pred) [4, 256, 10, 10] [4, 60, 10, 10] (recursive)
│ └─AnchorGenerator (anchor_generator) [4, 3, 640, 640] [13500, 4] --
├─RoIHeads (roi_heads) [4, 256, 20, 20] [0, 4] --
│ └─MultiScaleRoIAlign (box_roi_pool) [4, 256, 20, 20] [0, 256, 7, 7] --
│ └─TwoMLPHead (box_head) [0, 256, 7, 7] [0, 1024] --
│ │ └─Linear (fc6) [0, 12544] [0, 1024] 12,846,080
│ │ └─Linear (fc7) [0, 1024] [0, 1024] 1,049,600
│ └─FastRCNNPredictor (box_predictor) [0, 1024] [0, 3] --
│ │ └─Linear (cls_score) [0, 1024] [0, 3] 3,075
│ │ └─Linear (bbox_pred) [0, 1024] [0, 12] 12,300

Total params: 18,935,354
Trainable params: 18,876,458
Non-trainable params: 58,896
Total mult-adds (G): 11.49

Input size (MB): 19.66
Forward/backward pass size (MB): 1172.14
Params size (MB): 75.74
Estimated Total Size (MB): 1267.54

18,935,354 total parameters.
18,876,458 training parameters.
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
Epoch: [0] [ 0/222] eta: 0:11:02 lr: 0.000006 loss: 1.8196 (1.8196) loss_classifier: 1.4352 (1.4352) loss_box_reg: 0.3557 (0.3557) loss_objectness: 0.0227 (0.0227) loss_rpn_box_reg: 0.0060 (0.0060) time: 2.9830 data: 1.9134 max mem: 704
Epoch: [0] [100/222] eta: 0:00:22 lr: 0.000458 loss: 1.2597 (1.3672) loss_classifier: 0.5182 (0.6553) loss_box_reg: 0.7019 (0.6994) loss_objectness: 0.0014 (0.0098) loss_rpn_box_reg: 0.0025 (0.0027) time: 0.1611 data: 0.0257 max mem: 811
Epoch: [0] [200/222] eta: 0:00:03 lr: 0.000910 loss: 0.8597 (1.1901) loss_classifier: 0.2865 (0.5291) loss_box_reg: 0.5280 (0.6531) loss_objectness: 0.0006 (0.0057) loss_rpn_box_reg: 0.0013 (0.0023) time: 0.1735 data: 0.0235 max mem: 811
Epoch: [0] [221/222] eta: 0:00:00 lr: 0.001000 loss: 0.8436 (1.1645) loss_classifier: 0.3099 (0.5145) loss_box_reg: 0.5193 (0.6426) loss_objectness: 0.0005 (0.0053) loss_rpn_box_reg: 0.0012 (0.0022) time: 0.1591 data: 0.0203 max mem: 811
Epoch: [0] Total time: 0:00:34 (0.1552 s / it)
creating index...
index created!
Traceback (most recent call last):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in
main(args)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main
stats, val_pred_image = evaluate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate
for images, targets in metric_logger.log_every(data_loader, 100, header):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every
log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in
main(args)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/train.py", line 423, in main
stats, val_pred_image = evaluate(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/engine.py", line 136, in evaluate
for images, targets in metric_logger.log_every(data_loader, 100, header):
File "/content/drive/MyDrive/Program/CupangDetection/fasterrcnn-pytorch-training-pipeline/torch_utils/utils.py", line 202, in log_every
log(f"{header} Total time: {total_time_str} ({total_time / len(iterable):.4f} s / it)")
ZeroDivisionError: float division by zero
wandb: 🚀 View run expert-fire-4 at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline/runs/bw79izjd
wandb: ⭐️ View project at: https://wandb.ai/pusatsudiaiusb/fasterrcnn-pytorch-training-pipeline
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240624_152326-bw79izjd/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

@maexrakete
Copy link
Contributor

Number of validation samples: 0

looks like you lack validation data

fyi: if you wrap your error log into triple backticks it would be way more readable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants