Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

Update subprocess.py #102

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Constannnnnt
Copy link

@Constannnnnt Constannnnnt commented Jul 6, 2018

Description: I use 3 GPUs to train the network and interrupt at some point before the final step, which means I only save the checkpoint but not config. Then, I try to test the model, which unexpectedly failed and the error message is start = subinds[i][0], list index out of range.

Issue: I think at the line 64, instead of writing gpu_inds = range(cfg.NUM_GPUS), I think it is much more reasonable to write gpu_inds = range(NUM_GPUS). Let me explain it.

After import the yaml and config file in subprocess.py, cfg.NUM_GPUs is 8 instead of 3 (well, in train_net_step, there is a statement which assigns cfg.NUM_GPUs = torch.cuda.device_count(), so it does not crash), and NUM_GPUs = torch.cuda.device_count() = 3 in my case, and it turns out that at line 56, the size of subins is 3.

I choose to let cuda see all my GPUs, Later, at line 64, if gpu_inds = range(cfg.NUM_GPUS) is used, the size of gpu_indx is 8, which then will crash at line 68. Therefore, at line 64, gpus_inds = range(NUM_GPUs) is much more reasonable.

Please check and see if my solution is correct or not. Thanks.

@ternaus
Copy link

ternaus commented Sep 30, 2018

👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants