punkt_tab resource not found - llama2 70b #305

anandhu-eng · 2024-09-27T07:35:27Z

error:

INFO:root:* cm run script "run accuracy mlperf _open-orca _int32"
DEBUG:root:  - Number of scripts found: 1
DEBUG:root:  - Found script::process-mlperf-accuracy,6e809013816b42ea in /home/cmuser/CM/repos/anandhu-eng@cm4mlops/script/process-mlperf-accuracy
DEBUG:root:    Prepared variations: _open-orca,_int32,_default-pycocotools
DEBUG:root:  - Checking dependencies on other CM scripts:
INFO:root:  * cm run script "get python3"
DEBUG:root:    - Number of scripts found: 1
DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,python3
DEBUG:root:      - Number of cached script outputs found: 1
DEBUG:root:    - Found script::get-python3,d0b5dd74373f4a62 in /home/cmuser/CM/repos/anandhu-eng@cm4mlops/script/get-python3
DEBUG:root:    - Checking if script execution is already cached ...
DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,python3,python,get-python,get-python3
DEBUG:root:      - Found cached script output: /home/cmuser/CM/repos/local/cache/3db121e03f6b487c
DEBUG:root:      - Checking prehook dependencies on other CM scripts:
DEBUG:root:        - Loading state from cached entry ...
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/3db121e03f6b487c/cm-cached-state.json
DEBUG:root:      - Checking posthook dependencies on other CM scripts:
DEBUG:root:      - Checking post dependencies on other CM scripts:
INFO:root:    - running time of script "get,python,python3,get-python,get-python3": 0.00 sec.
INFO:root:Path to Python: /home/cmuser/venv/cm/bin/python3
INFO:root:Python version: 3.10.12
INFO:root:  * cm run script "get mlcommons inference src"
DEBUG:root:    - Number of scripts found: 1
DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,mlcommons,inference,src
DEBUG:root:      - Number of cached script outputs found: 2
DEBUG:root:    - Found script::get-mlperf-inference-src,4b57186581024797 in /home/cmuser/CM/repos/anandhu-eng@cm4mlops/script/get-mlperf-inference-src
DEBUG:root:      Prepared variations: _short-history
DEBUG:root:    - Checking if script execution is already cached ...
DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,mlcommons,inference,src,source,inference-src,inference-source,mlperf
DEBUG:root:      - Found cached script output: /home/cmuser/CM/repos/local/cache/3b16006ffdbb4e92
DEBUG:root:    - Checking dynamic dependencies on other CM scripts:
DEBUG:root:    - Processing env after dependencies ...
DEBUG:root:      - Checking prehook dependencies on other CM scripts:
DEBUG:root:        - Loading state from cached entry ...
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/3b16006ffdbb4e92/cm-cached-state.json
DEBUG:root:      - Checking posthook dependencies on other CM scripts:
DEBUG:root:      - Checking post dependencies on other CM scripts:
INFO:root:    - running time of script "get,src,source,inference,inference-src,inference-source,mlperf,mlcommons": 0.00 sec.
INFO:root:  * cm run script "get dataset openorca preprocessed"
DEBUG:root:    - Number of scripts found: 1
DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,dataset,openorca,preprocessed
DEBUG:root:      - Number of cached script outputs found: 1
DEBUG:root:    - Found script::get-preprocessed-dataset-openorca,5614c39cb1564d72 in /home/cmuser/CM/repos/anandhu-eng@cm4mlops/script/get-preprocessed-dataset-openorca
DEBUG:root:      Prepared variations: _full,_validation
DEBUG:root:    - Checking if script execution is already cached ...
DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,dataset,openorca,preprocessed,language-processing
DEBUG:root:      - Found cached script output: /home/cmuser/CM/repos/local/cache/ea164036556a4c03
DEBUG:root:    - Checking dynamic dependencies on other CM scripts:
DEBUG:root:    - Processing env after dependencies ...
DEBUG:root:      - Checking prehook dependencies on other CM scripts:
DEBUG:root:        - Loading state from cached entry ...
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/ea164036556a4c03/cm-cached-state.json
DEBUG:root:      - Checking posthook dependencies on other CM scripts:
DEBUG:root:      - Checking post dependencies on other CM scripts:
INFO:root:    - running time of script "get,dataset,openorca,language-processing,preprocessed": 0.00 sec.
INFO:root:  * cm run script "get ml-model llama2"
DEBUG:root:    - Number of scripts found: 1
DEBUG:root:    - Searching for cached script outputs with the following tags: -tmp,get,ml-model,llama2
DEBUG:root:      - Number of cached script outputs found: 1
DEBUG:root:    - Found script::get-ml-model-llama2,5db97be9f61244c6 in /home/cmuser/CM/repos/anandhu-eng@cm4mlops/script/get-ml-model-llama2
DEBUG:root:      Prepared variations: _meta-llama/Llama-2-70b-chat-hf,_fp32,_pytorch
DEBUG:root:    - Checking if script execution is already cached ...
DEBUG:root:      - Searching for cached script outputs with the following tags: -tmp,get,ml-model,llama2,raw,language-processing,llama2-70b,text-summarization
DEBUG:root:      - Found cached script output: /home/cmuser/CM/repos/local/cache/892fcb8caeb84045
DEBUG:root:      - Checking prehook dependencies on other CM scripts:
DEBUG:root:        - Loading state from cached entry ...
INFO:root:       ! load /home/cmuser/CM/repos/local/cache/892fcb8caeb84045/cm-cached-state.json
DEBUG:root:      - Checking posthook dependencies on other CM scripts:
DEBUG:root:      - Checking post dependencies on other CM scripts:
INFO:root:    - running time of script "get,raw,ml-model,language-processing,llama2,llama2-70b,text-summarization": 0.00 sec.
INFO:root:LLAMA2 checkpoint path: /home/cmuser/CM/repos/local/cache/a1373a86051e45f7/repo
DEBUG:root:  - Processing env after dependencies ...
DEBUG:root:  - Running preprocess ...
DEBUG:root:  - Running native script "/home/cmuser/CM/repos/anandhu-eng@cm4mlops/script/process-mlperf-accuracy/run.sh" from temporal script "tmp-run.sh" in "/home/cmuser" ...
INFO:root:       ! cd /home/cmuser
INFO:root:       ! call /home/cmuser/CM/repos/anandhu-eng@cm4mlops/script/process-mlperf-accuracy/run.sh from tmp-run.sh
[nltk_data] Downloading package punkt to /home/cmuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Traceback (most recent call last):
  File "/home/cmuser/CM/repos/local/cache/5868cd261a8a4ebc/inference/language/llama2-70b/evaluate-accuracy.py", line 111, in <module>
    main()
  File "/home/cmuser/CM/repos/local/cache/5868cd261a8a4ebc/inference/language/llama2-70b/evaluate-accuracy.py", line 91, in main
    preds, targets = postprocess_text(preds_decoded_text, target_required)
  File "/home/cmuser/CM/repos/local/cache/5868cd261a8a4ebc/inference/language/llama2-70b/evaluate-accuracy.py", line 37, in postprocess_text
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
  File "/home/cmuser/CM/repos/local/cache/5868cd261a8a4ebc/inference/language/llama2-70b/evaluate-accuracy.py", line 37, in <listcomp>
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
  File "/home/cmuser/venv/cm/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize
    tokenizer = _get_punkt_tokenizer(language)
  File "/home/cmuser/venv/cm/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer
    return PunktTokenizer(language)
  File "/home/cmuser/venv/cm/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
    self.load_lang(lang)
  File "/home/cmuser/venv/cm/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
    lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
  File "/home/cmuser/venv/cm/lib/python3.10/site-packages/nltk/data.py", line 579, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt_tab')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:
    - '/home/cmuser/nltk_data'
    - '/home/cmuser/venv/cm/nltk_data'
    - '/home/cmuser/venv/cm/share/nltk_data'
    - '/home/cmuser/venv/cm/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


CM error: Portable CM script failed (name = process-mlperf-accuracy, return code = 256)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable

The text was updated successfully, but these errors were encountered:

anandhu-eng self-assigned this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

punkt_tab resource not found - llama2 70b #305

punkt_tab resource not found - llama2 70b #305

anandhu-eng commented Sep 27, 2024

punkt_tab resource not found - llama2 70b #305

punkt_tab resource not found - llama2 70b #305

Comments

anandhu-eng commented Sep 27, 2024