Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I properly resume my training? #108

Open
JerickoDG opened this issue Oct 12, 2023 · 3 comments
Open

How can I properly resume my training? #108

JerickoDG opened this issue Oct 12, 2023 · 3 comments

Comments

@JerickoDG
Copy link

JerickoDG commented Oct 12, 2023

My initial training command was the ff:

python train.py --data data_configs/custom_data.yaml --epochs 20 --model fasterrcnn_resnet50_fpn_v2 --name custom_training --batch 4 --imgsz 320

I want to train it for 20 epochs with a batch size of 4 wherein the images are 320x320.

I cancelled my training during the process of 5th epoch (4 if zero-indexed). Now, I would like to resume my training from where I stopped/cancelled. I would like to do this indefinitely until I finish the 20th epoch. Do you know the proper command for resuming training?

This is the command I tried:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --resume

It worked but stopped after trying to save the best model for epoch 5 (4 if zero-indexed).

This is the error shown on the terminal:

OSError: [WinError 1314] A required privilege is not held by the client: 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\outputs\training\custom_training\best_model.pth' -> 'C:\Users\X\Desktop\pd_faster_rcnn_pytorch\fasterrcnn-pytorch-training-pipeline\wandb\offline-run-20231012_212412-eh5ul1xy\files\outputs\training\custom_training\best_model.pth'

I hope for your response regarding my inquiry. Thank you.

@JerickoDG JerickoDG changed the title How can I resume my training? How can I properly resume my training? Oct 12, 2023
@sovit-123
Copy link
Owner

Hello. It seems that you do not have write access to the disk. Can you try running the code in a terminal as Administrator.

@JerickoDG
Copy link
Author

Hi. Thanks for replying to my question. I tried adding a value for the --epochs parameter and set it to 20 resulting to this command:

python train.py --model fasterrcnn_resnet50_fpn_v2 --weights outputs\training\custom_training\last_model.pth --data data_configs\custom_data.yaml --imgsz 320 --name custom_training --epochs 20 --resume

It worked past the 6th epoch (5 if zero-indexed) and currently at the 7th epoch (6 if zero-indexed).

But I'll still observe the training and would try your suggestion if the aforementioned problem appeared again. I'll provide an update if ever. Thank you again.

@sovit-123
Copy link
Owner

Got. I had not seen that you did not pass the epochs argument in the previous command. Glad that it was solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants