Stuck once in a while during training #1312

zhuohaoyu · 2021-05-01T12:58:47Z

Describe the bug
A clear and concise description of what the bug is.

I'm fine-tuning some pretrained models on the task ReCoRD.

During training, the progress bar indicates that it get stuck every certain rounds.
By "stuck" I mean the training process just pause for a while, still occupying the GPU memory, but GPU utilization is at 0%.

For example, while I try to fine-tune roberta-base on record, the training pause for a few minutes every 313 iterations.

To Reproduce

Tell use which version of jiant you're using
The current master branch of jiant.
Describe the environment where you're using jiant, e.g, "2 P40 GPUs"
1 RTX 3090 GPU, this problem also happened while I use 2 or more RTX 3090 GPUs.

Provide the experiment config artifact (e.g., defaults.conf)
Script used to train:

export MODEL_NAME=roberta-base
nohup python jiant/proj/simple/runscript.py \
run_with_continue \
--run_name simple \
--exp_dir ${EXP_DIR}/ \
--data_dir ${EXP_DIR}/tasks \
--hf_pretrained_model_name_or_path $MODEL_NAME \
--tasks record \
--train_batch_size 28 \
--num_train_epochs 5 \
--max_seq_length 512 \
--learning_rate 1e-5 \
--seed 2021 \
--eval_every_steps 10000 \
--save_checkpoint_every_steps 5000 \
--do_save_best \
--write_test_preds \
--fp16 \
> ./nohup/$TASK_NAME-$MODEL_NAME.log 2>&1 &


Generated config file:

{
"task_config_path_dict": {
    "record": "/my-directory/tasks/configs/record_config.json"
},
"task_cache_config_dict": {
    "record": {
    "train": "/my-directory/cache/roberta/record/train",
    "val": "/my-directory/cache/roberta/record/val",
    "val_labels": "/my-directory/cache/roberta/record/val_labels",
    "test": "/my-directory/cache/roberta/record/test"
    }
},
"sampler_config": {
    "sampler_type": "ProportionalMultiTaskSampler"
},
"global_train_config": {
    "max_steps": 320870,
    "warmup_steps": 32087
},
"task_specific_configs_dict": {
    "record": {
    "train_batch_size": 28,
    "eval_batch_size": 56,
    "gradient_accumulation_steps": 1,
    "eval_subset_num": 500
    }
},
"taskmodels_config": {
    "task_to_taskmodel_map": {
    "record": "record"
    },
    "taskmodel_config_map": {
    "record": null
    }
},
"task_run_config": {
    "train_task_list": [
    "record"
    ],
    "train_val_task_list": [
    "record"
    ],
    "val_task_list": [
    "record"
    ],
    "test_task_list": [
    "record"
    ]
},
"metric_aggregator_config": {
    "metric_aggregator_type": "EqualMetricAggregator"
}
}

The corresponding log file is shown below. In this test, the training process stuck exactly 313 rounds.


(omitted)

Training:   0%|          | 310/280760 [07:04<37:22:41,  2.08it/s]
Training:   0%|          | 311/280760 [07:04<36:11:52,  2.15it/s]
Training:   0%|          | 312/280760 [07:05<35:27:45,  2.20it/s]
Training:   0%|          | 313/280760 [11:55<6809:51:21, 87.42s/it]
Training:   0%|          | 314/280760 [11:55<4777:00:16, 61.32s/it]
Training:   0%|          | 315/280760 [11:56<3354:14:50, 43.06s/it]
Training:   0%|          | 316/280760 [11:56<2358:11:53, 30.27s/it]
Training:   0%|          | 317/280760 [11:57<1660:46:41, 21.32s/it]
Training:   0%|          | 318/280760 [11:57<1172:59:53, 15.06s/it]
Training:   0%|          | 319/280760 [11:58<831:11:00, 10.67s/it] 
Training:   0%|          | 320/280760 [11:58<592:04:20,  7.60s/it]
Training:   0%|          | 321/280760 [11:58<424:25:38,  5.45s/it]
Training:   0%|          | 322/280760 [11:59<307:35:48,  3.95s/it]
Training:   0%|          | 323/280760 [11:59<225:18:09,  2.89s/it]
Training:   0%|          | 324/280760 [12:00<168:16:41,  2.16s/it]
Training:   0%|          | 325/280760 [12:00<127:47:12,  1.64s/it]
Training:   0%|          | 326/280760 [12:01<100:03:22,  1.28s/it]
Training:   0%|          | 327/280760 [12:01<80:19:50,  1.03s/it] 
Training:   0%|          | 328/280760 [12:02<66:49:48,  1.17it/s]
Training:   0%|          | 329/280760 [12:02<57:22:14,  1.36it/s]
Training:   0%|          | 330/280760 [12:02<50:46:43,  1.53it/s]
Training:   0%|          | 331/280760 [12:03<46:09:03,  1.69it/s]
Training:   0%|          | 332/280760 [12:03<42:54:50,  1.82it/s]

(omitted)

Training:   0%|          | 611/280760 [14:27<36:43:06,  2.12it/s]
Training:   0%|          | 612/280760 [14:27<35:43:06,  2.18it/s]
Training:   0%|          | 613/280760 [14:28<35:06:53,  2.22it/s]
Training:   0%|          | 614/280760 [14:28<34:35:18,  2.25it/s]
Training:   0%|          | 615/280760 [14:29<34:22:15,  2.26it/s]
Training:   0%|          | 616/280760 [14:29<34:05:47,  2.28it/s]
Training:   0%|          | 617/280760 [14:30<33:53:34,  2.30it/s]
Training:   0%|          | 618/280760 [14:30<33:57:44,  2.29it/s]
Training:   0%|          | 619/280760 [14:30<33:49:07,  2.30it/s]
Training:   0%|          | 620/280760 [14:31<33:44:00,  2.31it/s]
Training:   0%|          | 621/280760 [14:31<33:38:29,  2.31it/s]
Training:   0%|          | 622/280760 [14:32<33:37:11,  2.31it/s]
Training:   0%|          | 623/280760 [14:32<33:43:37,  2.31it/s]
Training:   0%|          | 624/280760 [14:33<33:42:49,  2.31it/s]
Training:   0%|          | 625/280760 [14:33<33:38:34,  2.31it/s]
Training:   0%|          | 626/280760 [19:40<7181:05:11, 92.28s/it]
Training:   0%|          | 627/280760 [19:40<5036:47:44, 64.73s/it]
Training:   0%|          | 628/280760 [19:40<3535:46:17, 45.44s/it]
Training:   0%|          | 629/280760 [19:41<2484:54:44, 31.93s/it]
Training:   0%|          | 630/280760 [19:41<1749:19:43, 22.48s/it]
Training:   0%|          | 631/280760 [19:42<1234:26:03, 15.86s/it]
Training:   0%|          | 632/280760 [19:42<873:58:52, 11.23s/it] 
Training:   0%|          | 633/280760 [19:43<621:49:39,  7.99s/it]
Training:   0%|          | 634/280760 [19:43<445:15:46,  5.72s/it]
Training:   0%|          | 635/280760 [19:43<321:35:22,  4.13s/it]
Training:   0%|          | 636/280760 [19:44<234:59:42,  3.02s/it]
Training:   0%|          | 637/280760 [19:44<174:25:04,  2.24s/it]
Training:   0%|          | 638/280760 [19:45<131:59:24,  1.70s/it]
Training:   0%|          | 639/280760 [19:45<102:18:44,  1.31s/it]
Training:   0%|          | 640/280760 [19:46<81:30:45,  1.05s/it] 
Training:   0%|          | 641/280760 [19:46<66:59:51,  1.16it/s]
Training:   0%|          | 642/280760 [19:46<56:46:48,  1.37it/s]
Training:   0%|          | 643/280760 [19:47<49:45:31,  1.56it/s]
Training:   0%|          | 644/280760 [19:47<44:43:03,  1.74it/s]
Training:   0%|          | 645/280760 [19:48<41:12:42,  1.89it/s]
Training:   0%|          | 646/280760 [19:48<38:47:01,  2.01it/s]
Training:   0%|          | 647/280760 [19:49<37:11:19,  2.09it/s]
Training:   0%|          | 648/280760 [19:49<35:59:04,  2.16it/s]

Expected behavior
A clear and concise description of what you expected to happen.

Training without intterupt, or at least show the reason why it's stuck.

Screenshots
If applicable, add screenshots to help explain your problem.

The only output I got was the log file shown above.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

Mollylulu · 2023-07-18T06:30:02Z

same problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck once in a while during training #1312

Stuck once in a while during training #1312

zhuohaoyu commented May 1, 2021

Mollylulu commented Jul 18, 2023

Stuck once in a while during training #1312

Stuck once in a while during training #1312

Comments

zhuohaoyu commented May 1, 2021

Mollylulu commented Jul 18, 2023