Crash when testing on a re-run experiment

A `FileNotFoundError` is raised when trying to read the `trainer_state.json` file during testing.
```
A{'eval_loss': 3.2840306758880615, 'eval_bleu': 68.508, 'eval_gen_len': 86.12, 'eval_runtime': 76.8471, 'eval_samples_per_second': 3.253, 'eval_steps_per_second': 0.208, 'epoch': 103.23}
 12% 12000/100000 [4:48:07<28:41:36,  1.17s/it]
100% 16/16 [01:11<00:00,  4.24s/it]A
                                   A[INFO|trainer.py:2939] 2023-10-30 04:21:56,649 >> Saving model checkpoint to /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000
[INFO|configuration_utils.py:460] 2023-10-30 04:21:56,650 >> Configuration saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/config.json
[INFO|configuration_utils.py:544] 2023-10-30 04:21:56,650 >> Configuration saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/generation_config.json
2023-10-29 23:22:06
[INFO|modeling_utils.py:2118] 2023-10-30 04:22:04,113 >> Model weights saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2420] 2023-10-30 04:22:04,118 >> tokenizer config file saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2429] 2023-10-30 04:22:04,118 >> Special tokens file saved in /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-12000/special_tokens_map.json
2023-10-29 23:22:32
[INFO|trainer.py:3026] 2023-10-30 04:22:28,396 >> Deleting older checkpoint [/tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-11000] due to args.save_total_limit
[INFO|trainer.py:2017] 2023-10-30 04:22:30,237 >> 
Training completed. Do not forget to share your model on huggingface.co/models =)
[INFO|trainer.py:2196] 2023-10-30 04:22:30,238 >> Loading best model from /tmp/tmpn5j4znz3/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/checkpoint-8000 (score: 68.6528).
2023-10-29 23:22:37
2023-10-30 04:22:32,316 - clearml.model - INFO - Selected model id: d66434b9630546c6b218b7a6b8e1936f
{'train_runtime': 17325.1009, 'train_samples_per_second': 369.406, 'train_steps_per_second': 5.772, 'train_loss': 3.3483218892415363, 'epoch': 103.23}
 12% 12000/100000 [4:48:45<28:41:36,  1.17s/it]
2023-10-29 23:22:47
 12% 12000/100000 [4:48:54<35:18:39,  1.44s/it]
***** train metrics *****
  epoch                    =     103.23
  train_loss               =     3.3483
  train_runtime            = 4:48:45.10
  train_samples            =       7430
  train_samples_per_second =    369.406
  train_steps_per_second   =      5.772
Training completed
2023-10-29 23:22:53
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,386 >> loading file sentencepiece.bpe.model
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2013] 2023-10-30 04:22:48,387 >> loading file tokenizer_config.json
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 141, in <module>
    main()
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 137, in main
    exp.run()
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 41, in run
    self.test()
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/experiment.py", line 69, in test
    test(
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/test.py", line 526, in test
    if best and config.has_best_checkpoint:
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/hugging_face_config.py", line 294, in has_best_checkpoint
    return has_best_checkpoint(self.model_dir)
  File "/root/.clearml/venvs-builds/3.8/task_repository/silnlp.git/silnlp/nmt/hugging_face_config.py", line 140, in has_best_checkpoint
    with trainer_state_path.open("r", encoding="utf-8") as f:
  File "/usr/lib/python3.8/pathlib.py", line 1222, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/usr/lib/python3.8/pathlib.py", line 1078, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmppiw6woeh/FT-Nepal/NLLB.1.3B.ne_YakNbt-ybh_Yak.NT.2/run/trainer_state.json'
2023-10-29 23:22:58
Process failed, exit code 1
```

This occurs when an existing experiment is reset and enqueued from the ClearML UI. This occurs because the train step uses a different model directory than the test step. When an experiment is re-run, ClearML will override the HF training arguments with the original values, including the model directory. This is done to make an experiment reproducible but causes a crash in our case. The `ClearMLCallback` in HF transformers connects the HF training arguments to the experiment, which allows ClearML to override the values. In order to fix this, we would need to disable this behavior in `ClearMLCallback`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Crash when testing on a re-run experiment #218

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Crash when testing on a re-run experiment #218

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions