How can I properly resume my training? #108

JerickoDG · 2023-10-12T13:54:22Z

My initial training command was the ff:

python train.py --data data_configs/custom_data.yaml --epochs 20 --model fasterrcnn_resnet50_fpn_v2 --name custom_training --batch 4 --imgsz 320

I want to train it for 20 epochs with a batch size of 4 wherein the images are 320x320.

I cancelled my training during the process of 5th epoch (4 if zero-indexed). Now, I would like to resume my training from where I stopped/cancelled. I would like to do this indefinitely until I finish the 20th epoch. Do you know the proper command for resuming training?

This is the command I tried:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --resume

It worked but stopped after trying to save the best model for epoch 5 (4 if zero-indexed).

This is the error shown on the terminal:

OSError: [WinError 1314] A required privilege is not held by the client: 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\outputs\training\custom_training\best_model.pth' -> 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\wandb\offline-run-20231012_212412-eh5ul1xy\files\outputs\training\custom_training\best_model.pth'

I hope for your response regarding my inquiry. Thank you.

The text was updated successfully, but these errors were encountered:

sovit-123 · 2023-10-12T14:06:53Z

Hello. It seems that you do not have write access to the disk. Can you try running the code in a terminal as Administrator.

JerickoDG · 2023-10-12T14:24:31Z

Hi. Thanks for replying to my question. I tried adding a value for the --epochs parameter and set it to 20 resulting to this command:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --epochs 20 --resume

It worked past the 6th epoch (5 if zero-indexed) and currently at the 7th epoch (6 if zero-indexed).

But I'll still observe the training and would try your suggestion if the aforementioned problem appeared again. I'll provide an update if ever. Thank you again.

sovit-123 · 2023-10-12T14:27:17Z

Got. I had not seen that you did not pass the epochs argument in the previous command. Glad that it was solved.

JerickoDG changed the title ~~How can I resume my training?~~ How can I properly resume my training? Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I properly resume my training? #108

How can I properly resume my training? #108

JerickoDG commented Oct 12, 2023 •

edited

Loading

sovit-123 commented Oct 12, 2023

JerickoDG commented Oct 12, 2023

sovit-123 commented Oct 12, 2023

How can I properly resume my training? #108

How can I properly resume my training? #108

Comments

JerickoDG commented Oct 12, 2023 • edited Loading

sovit-123 commented Oct 12, 2023

JerickoDG commented Oct 12, 2023

sovit-123 commented Oct 12, 2023

JerickoDG commented Oct 12, 2023 •

edited

Loading