Open
Description
A FileNotFoundError
is raised when trying to read the trainer_state.json
file during testing.
A{'eval_loss': 3.2840306758880615, 'eval_bleu': 68.508, 'eval_gen_len': 86.12, 'eval_runtime': 76.8471, 'eval_samples_per_second': 3.253, 'eval_steps_per_second': 0.208, 'epoch': 103.23}
12% 12000/100000 [4:48:07<28:41:36, 1.17s/it]
100% 16/16 [01:11<00:00, 4.24s/it]A
A[INFO|trainer.py:2939] 2023-10-30 04:21:56,649 >> Saving model checkpoint to /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000
[INFO|configuration_utils.py:460] 2023-10-30 04:21:56,650 >> Configuration saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/config.json
[INFO|configuration_utils.py:544] 2023-10-30 04:21:56,650 >> Configuration saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/generation_config.json
2023-10-29 23:22:06
[INFO|modeling_utils.py:2118] 2023-10-30 04:22:04,113 >> Model weights saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2420] 2023-10-30 04:22:04,118 >> tokenizer config file saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2429] 2023-10-30 04:22:04,118 >> Special tokens file saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/special_tokens_map.json
2023-10-29 23:22:32
[INFO|trainer.py:3026] 2023-10-30 04:22:28,396 >> Deleting older checkpoint [/tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-11000] due to args.save_total_limit
[INFO|trainer.py:2017] 2023-10-30 04:22:30,237 >>
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2196] 2023-10-30 04:22:30,238 >> Loading best model from /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-8000 (score: 68.6528).
2023-10-29 23:22:37
2023-10-30 04:22:32,316 - clearml.model - INFO - Selected model id: d66434b9630546c6b218b7a6b8e1936f
{'train_runtime': 17325.1009, 'train_samples_per_second': 369.406, 'train_steps_per_second': 5.772, 'train_loss': 3.3483218892415363, 'epoch': 103.23}
12% 12000/100000 [4:48:45<28:41:36, 1.17s/it]
2023-10-29 23:22:47
12% 12000/100000 [4:48:54<35:18:39, 1.44s/it]
***** train metrics *****
epoch = 103.23
train_loss = 3.3483
train_runtime = 4:48:45.10
train_samples = 7430
train_samples_per_second = 369.406
train_steps_per_second = 5.772
Training completed
2023-10-29 23:22:53
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,386 >> loading file sentencepiece.bpe.model
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file tokenizer_config.json
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 141, in <module>
main()
File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 137, in main
exp.run()
File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 41, in run
self.test()
File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 69, in test
test(
File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/test.py", line 526, in test
if best and config.has_best_checkpoint:
File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/hugging_face_config.py", line 294, in has_best_checkpoint
return has_best_checkpoint(self.model_dir)
File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/hugging_face_config.py", line 140, in has_best_checkpoint
with trainer_state_path.open("r", encoding="utf-8") as f:
File "/usr/lib/python3.8/pathlib.py", line 1222, in open
return io.open(self, mode, buffering, encoding, errors, newline,
File "/usr/lib/python3.8/pathlib.py", line 1078, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmppiw6woeh/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/trainer_state.json'
2023-10-29 23:22:58
Process failed, exit code 1
This occurs when an existing experiment is reset and enqueued from the ClearML UI. This occurs because the train step uses a different model directory than the test step. When an experiment is re-run, ClearML will override the HF training arguments with the original values, including the model directory. This is done to make an experiment reproducible but causes a crash in our case. The ClearMLCallback
in HF transformers connects the HF training arguments to the experiment, which allows ClearML to override the values. In order to fix this, we would need to disable this behavior in ClearMLCallback
.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
🏗 In progress