Skip to content

Crash when testing on a re-run experiment #218

Open
@ddaspit

Description

@ddaspit

A FileNotFoundError is raised when trying to read the trainer_state.json file during testing.

A{'eval_loss': 3.2840306758880615, 'eval_bleu': 68.508, 'eval_gen_len': 86.12, 'eval_runtime': 76.8471, 'eval_samples_per_second': 3.253, 'eval_steps_per_second': 0.208, 'epoch': 103.23}
 12% 12000/100000 [4:48:07<28:41:36,  1.17s/it]
100% 16/16 [01:11<00:00,  4.24s/it]A
                                   A[INFO|trainer.py:2939] 2023-10-30 04:21:56,649 >> Saving model checkpoint to /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000
[INFO|configuration_utils.py:460] 2023-10-30 04:21:56,650 >> Configuration saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/config.json
[INFO|configuration_utils.py:544] 2023-10-30 04:21:56,650 >> Configuration saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/generation_config.json
2023-10-29 23:22:06
[INFO|modeling_utils.py:2118] 2023-10-30 04:22:04,113 >> Model weights saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2420] 2023-10-30 04:22:04,118 >> tokenizer config file saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2429] 2023-10-30 04:22:04,118 >> Special tokens file saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/special_tokens_map.json
2023-10-29 23:22:32
[INFO|trainer.py:3026] 2023-10-30 04:22:28,396 >> Deleting older checkpoint [/tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-11000] due to args.save_total_limit
[INFO|trainer.py:2017] 2023-10-30 04:22:30,237 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2196] 2023-10-30 04:22:30,238 >> Loading best model from /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-8000 (score: 68.6528).
2023-10-29 23:22:37
2023-10-30 04:22:32,316 - clearml.model - INFO - Selected model id: d66434b9630546c6b218b7a6b8e1936f
{'train_runtime': 17325.1009, 'train_samples_per_second': 369.406, 'train_steps_per_second': 5.772, 'train_loss': 3.3483218892415363, 'epoch': 103.23}
 12% 12000/100000 [4:48:45<28:41:36,  1.17s/it]
2023-10-29 23:22:47
 12% 12000/100000 [4:48:54<35:18:39,  1.44s/it]
***** train metrics *****
  epoch                    =     103.23
  train_loss               =     3.3483
  train_runtime            = 4:48:45.10
  train_samples            =       7430
  train_samples_per_second =    369.406
  train_steps_per_second   =      5.772
Training completed
2023-10-29 23:22:53
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,386 >> loading file sentencepiece.bpe.model
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file tokenizer_config.json
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 141, in <module>
    main()
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 137, in main
    exp.run()
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 41, in run
    self.test()
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 69, in test
    test(
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/test.py", line 526, in test
    if best and config.has_best_checkpoint:
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/hugging_face_config.py", line 294, in has_best_checkpoint
    return has_best_checkpoint(self.model_dir)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/hugging_face_config.py", line 140, in has_best_checkpoint
    with trainer_state_path.open("r", encoding="utf-8") as f:
  File "/usr/lib/python3.8/pathlib.py", line 1222, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/usr/lib/python3.8/pathlib.py", line 1078, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmppiw6woeh/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/trainer_state.json'
2023-10-29 23:22:58
Process failed, exit code 1

This occurs when an existing experiment is reset and enqueued from the ClearML UI. This occurs because the train step uses a different model directory than the test step. When an experiment is re-run, ClearML will override the HF training arguments with the original values, including the model directory. This is done to make an experiment reproducible but causes a crash in our case. The ClearMLCallback in HF transformers connects the HF training arguments to the experiment, which allows ClearML to override the values. In order to fix this, we would need to disable this behavior in ClearMLCallback.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

🏗 In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions