Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck once in a while during training #1312

Open
zhuohaoyu opened this issue May 1, 2021 · 1 comment
Open

Stuck once in a while during training #1312

zhuohaoyu opened this issue May 1, 2021 · 1 comment

Comments

@zhuohaoyu
Copy link

Describe the bug
A clear and concise description of what the bug is.

I'm fine-tuning some pretrained models on the task ReCoRD.

During training, the progress bar indicates that it get stuck every certain rounds.
By "stuck" I mean the training process just pause for a while, still occupying the GPU memory, but GPU utilization is at 0%.

For example, while I try to fine-tune roberta-base on record, the training pause for a few minutes every 313 iterations.

To Reproduce

  1. Tell use which version of jiant you're using
    The current master branch of jiant.

  2. Describe the environment where you're using jiant, e.g, "2 P40 GPUs"
    1 RTX 3090 GPU, this problem also happened while I use 2 or more RTX 3090 GPUs.

  3. Provide the experiment config artifact (e.g., defaults.conf)
    Script used to train:

    export MODEL_NAME=roberta-base
    nohup python jiant/proj/simple/runscript.py \
    run_with_continue \
    --run_name simple \
    --exp_dir ${EXP_DIR}/ \
    --data_dir ${EXP_DIR}/tasks \
    --hf_pretrained_model_name_or_path $MODEL_NAME \
    --tasks record \
    --train_batch_size 28 \
    --num_train_epochs 5 \
    --max_seq_length 512 \
    --learning_rate 1e-5 \
    --seed 2021 \
    --eval_every_steps 10000 \
    --save_checkpoint_every_steps 5000 \
    --do_save_best \
    --write_test_preds \
    --fp16 \
    > ./nohup/$TASK_NAME-$MODEL_NAME.log 2>&1 &
    
    
    Generated config file:
    
    {
    "task_config_path_dict": {
        "record": "/my-directory/tasks/configs/record_config.json"
    },
    "task_cache_config_dict": {
        "record": {
        "train": "/my-directory/cache/roberta/record/train",
        "val": "/my-directory/cache/roberta/record/val",
        "val_labels": "/my-directory/cache/roberta/record/val_labels",
        "test": "/my-directory/cache/roberta/record/test"
        }
    },
    "sampler_config": {
        "sampler_type": "ProportionalMultiTaskSampler"
    },
    "global_train_config": {
        "max_steps": 320870,
        "warmup_steps": 32087
    },
    "task_specific_configs_dict": {
        "record": {
        "train_batch_size": 28,
        "eval_batch_size": 56,
        "gradient_accumulation_steps": 1,
        "eval_subset_num": 500
        }
    },
    "taskmodels_config": {
        "task_to_taskmodel_map": {
        "record": "record"
        },
        "taskmodel_config_map": {
        "record": null
        }
    },
    "task_run_config": {
        "train_task_list": [
        "record"
        ],
        "train_val_task_list": [
        "record"
        ],
        "val_task_list": [
        "record"
        ],
        "test_task_list": [
        "record"
        ]
    },
    "metric_aggregator_config": {
        "metric_aggregator_type": "EqualMetricAggregator"
    }
    }
    

    The corresponding log file is shown below. In this test, the training process stuck exactly 313 rounds.


(omitted)

Training:   0%|          | 310/280760 [07:04<37:22:41,  2.08it/s]
Training:   0%|          | 311/280760 [07:04<36:11:52,  2.15it/s]
Training:   0%|          | 312/280760 [07:05<35:27:45,  2.20it/s]
Training:   0%|          | 313/280760 [11:55<6809:51:21, 87.42s/it]
Training:   0%|          | 314/280760 [11:55<4777:00:16, 61.32s/it]
Training:   0%|          | 315/280760 [11:56<3354:14:50, 43.06s/it]
Training:   0%|          | 316/280760 [11:56<2358:11:53, 30.27s/it]
Training:   0%|          | 317/280760 [11:57<1660:46:41, 21.32s/it]
Training:   0%|          | 318/280760 [11:57<1172:59:53, 15.06s/it]
Training:   0%|          | 319/280760 [11:58<831:11:00, 10.67s/it] 
Training:   0%|          | 320/280760 [11:58<592:04:20,  7.60s/it]
Training:   0%|          | 321/280760 [11:58<424:25:38,  5.45s/it]
Training:   0%|          | 322/280760 [11:59<307:35:48,  3.95s/it]
Training:   0%|          | 323/280760 [11:59<225:18:09,  2.89s/it]
Training:   0%|          | 324/280760 [12:00<168:16:41,  2.16s/it]
Training:   0%|          | 325/280760 [12:00<127:47:12,  1.64s/it]
Training:   0%|          | 326/280760 [12:01<100:03:22,  1.28s/it]
Training:   0%|          | 327/280760 [12:01<80:19:50,  1.03s/it] 
Training:   0%|          | 328/280760 [12:02<66:49:48,  1.17it/s]
Training:   0%|          | 329/280760 [12:02<57:22:14,  1.36it/s]
Training:   0%|          | 330/280760 [12:02<50:46:43,  1.53it/s]
Training:   0%|          | 331/280760 [12:03<46:09:03,  1.69it/s]
Training:   0%|          | 332/280760 [12:03<42:54:50,  1.82it/s]

(omitted)

Training:   0%|          | 611/280760 [14:27<36:43:06,  2.12it/s]
Training:   0%|          | 612/280760 [14:27<35:43:06,  2.18it/s]
Training:   0%|          | 613/280760 [14:28<35:06:53,  2.22it/s]
Training:   0%|          | 614/280760 [14:28<34:35:18,  2.25it/s]
Training:   0%|          | 615/280760 [14:29<34:22:15,  2.26it/s]
Training:   0%|          | 616/280760 [14:29<34:05:47,  2.28it/s]
Training:   0%|          | 617/280760 [14:30<33:53:34,  2.30it/s]
Training:   0%|          | 618/280760 [14:30<33:57:44,  2.29it/s]
Training:   0%|          | 619/280760 [14:30<33:49:07,  2.30it/s]
Training:   0%|          | 620/280760 [14:31<33:44:00,  2.31it/s]
Training:   0%|          | 621/280760 [14:31<33:38:29,  2.31it/s]
Training:   0%|          | 622/280760 [14:32<33:37:11,  2.31it/s]
Training:   0%|          | 623/280760 [14:32<33:43:37,  2.31it/s]
Training:   0%|          | 624/280760 [14:33<33:42:49,  2.31it/s]
Training:   0%|          | 625/280760 [14:33<33:38:34,  2.31it/s]
Training:   0%|          | 626/280760 [19:40<7181:05:11, 92.28s/it]
Training:   0%|          | 627/280760 [19:40<5036:47:44, 64.73s/it]
Training:   0%|          | 628/280760 [19:40<3535:46:17, 45.44s/it]
Training:   0%|          | 629/280760 [19:41<2484:54:44, 31.93s/it]
Training:   0%|          | 630/280760 [19:41<1749:19:43, 22.48s/it]
Training:   0%|          | 631/280760 [19:42<1234:26:03, 15.86s/it]
Training:   0%|          | 632/280760 [19:42<873:58:52, 11.23s/it] 
Training:   0%|          | 633/280760 [19:43<621:49:39,  7.99s/it]
Training:   0%|          | 634/280760 [19:43<445:15:46,  5.72s/it]
Training:   0%|          | 635/280760 [19:43<321:35:22,  4.13s/it]
Training:   0%|          | 636/280760 [19:44<234:59:42,  3.02s/it]
Training:   0%|          | 637/280760 [19:44<174:25:04,  2.24s/it]
Training:   0%|          | 638/280760 [19:45<131:59:24,  1.70s/it]
Training:   0%|          | 639/280760 [19:45<102:18:44,  1.31s/it]
Training:   0%|          | 640/280760 [19:46<81:30:45,  1.05s/it] 
Training:   0%|          | 641/280760 [19:46<66:59:51,  1.16it/s]
Training:   0%|          | 642/280760 [19:46<56:46:48,  1.37it/s]
Training:   0%|          | 643/280760 [19:47<49:45:31,  1.56it/s]
Training:   0%|          | 644/280760 [19:47<44:43:03,  1.74it/s]
Training:   0%|          | 645/280760 [19:48<41:12:42,  1.89it/s]
Training:   0%|          | 646/280760 [19:48<38:47:01,  2.01it/s]
Training:   0%|          | 647/280760 [19:49<37:11:19,  2.09it/s]
Training:   0%|          | 648/280760 [19:49<35:59:04,  2.16it/s]

Expected behavior
A clear and concise description of what you expected to happen.

Training without intterupt, or at least show the reason why it's stuck.

Screenshots
If applicable, add screenshots to help explain your problem.

The only output I got was the log file shown above.

Additional context
Add any other context about the problem here.

@Mollylulu
Copy link

same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants