Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train Script #12

Open
seTalent opened this issue Sep 25, 2024 · 0 comments
Open

Train Script #12

seTalent opened this issue Sep 25, 2024 · 0 comments

Comments

@seTalent
Copy link

你好,我在尝试使用如下脚本训练模型:
CUDA_VISIBLE_DEVICES=0,2 python -m torch.distributed.launch
--nproc_per_node=2 main_deit.py
--model cf_deit_small
--batch-size 16
--data-path ImageNet/
--coarse-stage-size 9
--dist-eval
--output train_log

(cf-vit) zky_1@4090-03:~/codes/CF-ViT$ bash train.bash
/data/zky_1/.local/lib/python3.9/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
报错如下:
main()
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779]
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] *****************************************
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0925 03:19:14.610449 139999741781824 torch/distributed/run.py:779] *****************************************
usage: DeiT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--model MODEL] [--coarse-stage-size COARSE_STAGE_SIZE] [--drop PCT] [--drop-path PCT] [--model-ema] [--no-model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--model-ema-force-cpu] [--opt OPTIMIZER]
[--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM] [--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N] [--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING] [--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug]
[--no-repeated-aug] [--reprob PCT] [--remode REMODE] [--recount RECOUNT] [--resplit] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]] [--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB]
[--mixup-mode MIXUP_MODE] [--teacher-model MODEL] [--teacher-path TEACHER_PATH] [--distillation-type {none,soft,hard}] [--distillation-alpha DISTILLATION_ALPHA] [--distillation-tau DISTILLATION_TAU] [--finetune FINETUNE] [--data-path DATA_PATH]
[--data-set {CIFAR,IMNET,INAT,INAT19,IMNET10,IMNET100}] [--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--world_size WORLD_SIZE] [--dist_url DIST_URL]
DeiT training and evaluation script: error: unrecognized arguments: --local-rank=0
usage: DeiT training and evaluation script [-h] [--batch-size BATCH_SIZE] [--epochs EPOCHS] [--model MODEL] [--coarse-stage-size COARSE_STAGE_SIZE] [--drop PCT] [--drop-path PCT] [--model-ema] [--no-model-ema] [--model-ema-decay MODEL_EMA_DECAY] [--model-ema-force-cpu] [--opt OPTIMIZER]
[--opt-eps EPSILON] [--opt-betas BETA [BETA ...]] [--clip-grad NORM] [--momentum M] [--weight-decay WEIGHT_DECAY] [--sched SCHEDULER] [--lr LR] [--lr-noise pct, pct [pct, pct ...]] [--lr-noise-pct PERCENT] [--lr-noise-std STDDEV] [--warmup-lr LR]
[--min-lr LR] [--decay-epochs N] [--warmup-epochs N] [--cooldown-epochs N] [--patience-epochs N] [--decay-rate RATE] [--color-jitter PCT] [--aa NAME] [--smoothing SMOOTHING] [--train-interpolation TRAIN_INTERPOLATION] [--repeated-aug]
[--no-repeated-aug] [--reprob PCT] [--remode REMODE] [--recount RECOUNT] [--resplit] [--mixup MIXUP] [--cutmix CUTMIX] [--cutmix-minmax CUTMIX_MINMAX [CUTMIX_MINMAX ...]] [--mixup-prob MIXUP_PROB] [--mixup-switch-prob MIXUP_SWITCH_PROB]
[--mixup-mode MIXUP_MODE] [--teacher-model MODEL] [--teacher-path TEACHER_PATH] [--distillation-type {none,soft,hard}] [--distillation-alpha DISTILLATION_ALPHA] [--distillation-tau DISTILLATION_TAU] [--finetune FINETUNE] [--data-path DATA_PATH]
[--data-set {CIFAR,IMNET,INAT,INAT19,IMNET10,IMNET100}] [--inat-category {kingdom,phylum,class,order,supercategory,family,genus,name}] [--output_dir OUTPUT_DIR] [--device DEVICE] [--seed SEED] [--resume RESUME] [--start_epoch N] [--eval] [--dist-eval]
[--num_workers NUM_WORKERS] [--pin-mem] [--no-pin-mem] [--world_size WORLD_SIZE] [--dist_url DIST_URL]
DeiT training and evaluation script: error: unrecognized arguments: --local-rank=1
W0925 03:19:27.336198 139999741781824 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1807982 closing signal SIGTERM
E0925 03:19:27.400585 139999741781824 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 1 (pid: 1807983) of binary: /data/anaconda3/envs/cf-vit/bin/python
Traceback (most recent call last):
File "/data/anaconda3/envs/cf-vit/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/anaconda3/envs/cf-vit/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 208, in
main()
File "/data/.local/lib/python3.9/site-packages/typing_extensions.py", line 2853, in wrapper
return arg(*args, **kwargs)
File "/data/local/lib/python3.9/site-packages/torch/distributed/launch.py", line 204, in main
launch(args)
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in launch
run(args)
File "/data/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_deit.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-09-25_03:19:27
host : 4090-03
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1807983)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

之后我在你的代码中添加了这样的一行代码解决(parser.add_argument("--local-rank")):
image
不过之后又遇到了新的问题,大致意思是训练的过程有些参数没有参数没有参与到计算过程中:
image

希望可以得到你的帮助,谢谢。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant