This repository has been archived by the owner on Jan 26, 2022. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description: I use 3 GPUs to train the network and interrupt at some point before the final step, which means I only save the checkpoint but not config. Then, I try to test the model, which unexpectedly failed and the error message is
start = subinds[i][0], list index out of range
.Issue: I think at the line 64, instead of writing
gpu_inds = range(cfg.NUM_GPUS)
, I think it is much more reasonable to writegpu_inds = range(NUM_GPUS)
. Let me explain it.After import the yaml and config file in
subprocess.py
, cfg.NUM_GPUs is 8 instead of 3 (well, in train_net_step, there is a statement which assigns cfg.NUM_GPUs = torch.cuda.device_count(), so it does not crash), and NUM_GPUs = torch.cuda.device_count() = 3 in my case, and it turns out that at line 56, the size ofsubins
is 3.I choose to let cuda see all my GPUs, Later, at line 64, if
gpu_inds = range(cfg.NUM_GPUS)
is used, the size ofgpu_indx
is 8, which then will crash at line 68. Therefore, at line 64,gpus_inds = range(NUM_GPUs)
is much more reasonable.Please check and see if my solution is correct or not. Thanks.