Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weasel serializes commands field as string #94

Open
caiorcferreira opened this issue Dec 12, 2024 · 4 comments
Open

Weasel serializes commands field as string #94

caiorcferreira opened this issue Dec 12, 2024 · 4 comments

Comments

@caiorcferreira
Copy link

Description

It looks like Weasel is reading commands as a string rather than a list. This causes access to the name field to raise the error TypeError: string indices must be integers.

Environment

Name Version Build Channel
weasel 0.3.4 py39hca03da5_0

Error

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/homebrew/Caskroom/miniconda/base/envs/spacy_dev_pt_core_chat_lg/lib/python3.9/site-packages │
│ /weasel/cli/run.py:42 in project_run_cli                                                         │
│                                                                                                  │
│    39 │   │   print_run_help(project_dir, subcommand, parent_command)                            │
│    40 │   else:                                                                                  │
│    41 │   │   overrides = parse_config_overrides(ctx.args)                                       │
│ ❱  42 │   │   project_run(                                                                       │
│    43 │   │   │   project_dir,                                                                   │
│    44 │   │   │   subcommand,                                                                    │
│    45 │   │   │   overrides=overrides,                                                           │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │            ctx = <click.core.Context object at 0x11b83e9a0>                                  │ │
│ │            dry = False                                                                       │ │
│ │          force = False                                                                       │ │
│ │      overrides = {                                                                           │ │
│ │                  │   'vars.experiment': 29,                                                  │ │
│ │                  │   'vars.enabled_gazetteers': 'person,address',                            │ │
│ │                  │   'vars.input_data':                                                      │ │
│ │                  'experiments/028/data/oversampled_merged_dataset.json',                     │ │
│ │                  │   'vars.address_gazetteer':                                               │ │
│ │                  'assets/datasets/addresses/pt_br_address-gazetter-2.jsonl'                  │ │
│ │                  }                                                                           │ │
│ │ parent_command = 'python -m weasel'                                                          │ │
│ │    project_dir = PosixPath('.')                                                              │ │
│ │      show_help = False                                                                       │ │
│ │     subcommand = 'experiment'                                                                │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /opt/homebrew/Caskroom/miniconda/base/envs/spacy_dev_pt_core_chat_lg/lib/python3.9/site-packages │
│ /weasel/cli/run.py:81 in project_run                                                             │
│                                                                                                  │
│    78 │   skip_requirements_check (bool): No longer used, deprecated.                            │
│    79 │   """                                                                                    │
│    80 │   config = load_project_config(project_dir, overrides=overrides)                         │
│ ❱  81 │   commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}                    │
│    82 │   workflows = config.get("workflows", {})                                                │
│    83 │   validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand)         │
│    84                                                                                            │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │                 capture = False                                                              │ │
│ │                  config = {                                                                  │ │
│ │                           │   'title': 'NER portuguese chat',                                │ │
│ │                           │   'description': 'Project tunning NER component in portuguese    │ │
│ │                           model using chat corpus',                                          │ │
│ │                           │   'directories': [                                               │ │
│ │                           │   │   'assets',                                                  │ │
│ │                           │   │   'scripts',                                                 │ │
│ │                           │   │   'experiments',                                             │ │
│ │                           │   │   'baseline',                                                │ │
│ │                           │   │   'packages'                                                 │ │
│ │                           │   ],                                                             │ │
│ │                           │   'assets': [                                                    │ │
│ │                           │   │   {                                                          │ │
│ │                           │   │   │   'dest': 'assets/train.json',                           │ │
│ │                           │   │   │   'description': 'Training data'                         │ │
│ │                           │   │   },                                                         │ │
│ │                           │   │   {                                                          │ │
│ │                           │   │   │   'dest': 'assets/dev.json',                             │ │
│ │                           │   │   │   'description': 'Development data'                      │ │
│ │                           │   │   }                                                          │ │
│ │                           │   ],                                                             │ │
│ │                           │   'commands': '[{"name":"download","help":"Download the          │ │
│ │                           pretrained pipeline","script":["python '+5046,                     │ │
│ │                           │   'env': {},                                                     │ │
│ │                           │   'vars': {                                                      │ │
│ │                           │   │   'name': 'core_chat_lg',                                    │ │
│ │                           │   │   'lang': 'pt',                                              │ │
│ │                           │   │   'pipeline': 'pt_core_news_lg',                             │ │
│ │                           │   │   'version': '0.0.0',                                        │ │
│ │                           │   │   'dataset': 'raw.json',                                     │ │
│ │                           │   │   'train': 'train.json',                                     │ │
│ │                           │   │   'dev': 'dev.json',                                         │ │
│ │                           │   │   'test': 'test.json',                                       │ │
│ │                           │   │   'test_data':                                               │ │
│ │                           'assets/datasets/chats/sample-chats-manual-labeled-test.json',     │ │
│ │                           │   │   'input_data':                                              │ │
│ │                           'assets/datasets/chats/sample-chats-manual-labeled-train.json',    │ │
│ │                           │   │   ... +9                                                     │ │
│ │                           │   },                                                             │ │
│ │                           │   'workflows': {                                                 │ │
│ │                           │   │   'experiment': [                                            │ │
│ │                           │   │   │   'fetch-data',                                          │ │
│ │                           │   │   │   'split-data',                                          │ │
│ │                           │   │   │   'create-gazetteer',                                    │ │
│ │                           │   │   │   'convert',                                             │ │
│ │                           │   │   │   'train',                                               │ │
│ │                           │   │   │   'evaluate'                                             │ │
│ │                           │   │   ],                                                         │ │
│ │                           │   │   'experiment_search': [                                     │ │
│ │                           │   │   │   'fetch-data',                                          │ │
│ │                           │   │   │   'split-data',                                          │ │
│ │                           │   │   │   'create-gazetteer',                                    │ │
│ │                           │   │   │   'convert',                                             │ │
│ │                           │   │   │   'train-search',                                        │ │
│ │                           │   │   │   'evaluate'                                             │ │
│ │                           │   │   ],                                                         │ │
│ │                           │   │   'experiment_new': ['setup_experiment', 'create-config']    │ │
│ │                           │   }                                                              │ │
│ │                           }                                                                  │ │
│ │                     dry = False                                                              │ │
│ │                   force = False                                                              │ │
│ │               overrides = {                                                                  │ │
│ │                           │   'vars.experiment': 29,                                         │ │
│ │                           │   'vars.enabled_gazetteers': 'person,address',                   │ │
│ │                           │   'vars.input_data':                                             │ │
│ │                           'experiments/028/data/oversampled_merged_dataset.json',            │ │
│ │                           │   'vars.address_gazetteer':                                      │ │
│ │                           'assets/datasets/addresses/pt_br_address-gazetter-2.jsonl'         │ │
│ │                           }                                                                  │ │
│ │          parent_command = 'python -m weasel'                                                 │ │
│ │             project_dir = PosixPath('.')                                                     │ │
│ │ skip_requirements_check = False                                                              │ │
│ │              subcommand = 'experiment'                                                       │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /opt/homebrew/Caskroom/miniconda/base/envs/spacy_dev_pt_core_chat_lg/lib/python3.9/site-packages │
│ /weasel/cli/run.py:81 in <dictcomp>                                                              │
│                                                                                                  │
│    78 │   skip_requirements_check (bool): No longer used, deprecated.                            │
│    79 │   """                                                                                    │
│    80 │   config = load_project_config(project_dir, overrides=overrides)                         │
│ ❱  81 │   commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}                    │
│    82 │   workflows = config.get("workflows", {})                                                │
│    83 │   validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand)         │
│    84                                                                                            │
│                                                                                                  │
│ ╭────────────────── locals ──────────────────╮                                                   │
│ │  .0 = <str_iterator object at 0x107dc13a0> │                                                   │
│ │ cmd = '['                                  │                                                   │
│ ╰────────────────────────────────────────────╯                                                   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: string indices must be integers
@honnibal
Copy link
Member

Thanks.

Could you paste your workflow file? This can't be the right error behaviour no matter what, but I'm trying to figure out whether it's doing this on a workflow that should work, or whether it's just heading down the wrong error path.

@caiorcferreira
Copy link
Author

My workflow is a bit customized in how it uses variables, but it worked for a time. I wonder if my conda environment changed some version that led to this new behavior.

Workflow file:

title: "NER portuguese chat"
description: "Project tunning NER component in portuguese model using chat corpus"

# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
  name: "core_chat_lg"
  lang: "pt"
  pipeline: "pt_core_news_lg"
  version: "0.0.0"

  dataset: "raw.json"
  train: "train.json"
  dev: "dev.json"
  test: "test.json"

  test_data: "assets/datasets/chats/sample-chats-manual-labeled-test.json"
  input_data: "assets/datasets/chats/sample-chats-manual-labeled-train.json"

  experiment: "01"
  train_size: 0.8

  enabled_gazetteers: "null"
  person_gazetteer: "assets/datasets/names_surnames/pt_br_names-gazetteer.jsonl"
  address_gazetteer: "assets/datasets/addresses/pt_br_address-gazetter.jsonl"
  person_entity_ruler_patterns: "null"
  loc_entity_ruler_patterns: "null"

  gazetteers_pattern: "gazetteers_patterns.jsonl"

  # Set your GPU ID, -1 is CPU
  gpu_id: -1

# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["assets", "scripts", "experiments", "baseline", "packages"]

# Assets that should be downloaded or available in the directory. We're shipping
# them with the project, so they won't have to be downloaded.
assets:
  - dest: "assets/train.json"
    description: "Training data"
  - dest: "assets/dev.json"
    description: "Development data"

# Workflows are sequences of commands (see below) executed in order. You can
# run them via "spacy project run [workflow]". If a commands's inputs/outputs
# haven't changed, it won't be re-run.
workflows:
  experiment:
    - fetch-data
    - split-data
    - create-gazetteer
    - convert
    - train
    - evaluate

  experiment_search:
    - fetch-data
    - split-data
    - create-gazetteer
    - convert
    - train-search
    - evaluate

  experiment_new:
    - setup_experiment
    - create-config

# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
- name: "download"
  help: "Download the pretrained pipeline"
  script:
    - "python -m spacy download ${vars.pipeline}"

- name: "setup_experiment"
  help: "Setup experiment directory structure"
  script:
    - "mkdir -p experiments/0${vars.experiment}/data experiments/0${vars.experiment}/configs experiments/0${vars.experiment}/training experiments/0${vars.experiment}/corpus experiments/0${vars.experiment}/scripts"
    - "touch experiments/0${vars.experiment}/README.md"

- name: "create-config"
  help: "Create a config for updating only NER from an existing pipeline"
  script:
    - "python scripts/create_config.py ${vars.pipeline} ner experiments/0${vars.experiment}/data/${vars.gazetteers_pattern} ${vars.enabled_gazetteers} experiments/0${vars.experiment}/configs/config.cfg"
  deps:
    - "scripts/create_config.py"
  outputs:
    - "experiments/0${vars.experiment}/configs/config.cfg"

- name: "fetch-data"
  help: "Fetch the training and test data"
  script:
    - "cp ${vars.input_data} experiments/0${vars.experiment}/data/${vars.dataset}"
    - "cp ${vars.test_data} experiments/0${vars.experiment}/data/${vars.test}"
  deps:
    - "${vars.input_data}"
    - "${vars.test_data}"
  outputs:
    - "experiments/0${vars.experiment}/data/${vars.dataset}"
    - "experiments/0${vars.experiment}/data/${vars.test}"

- name: "split-data"
  help: "Split the data into training and eval sets, and copy the test data"
  script:
    - "python scripts/split_train_test.py experiments/0${vars.experiment}/data/${vars.dataset} ${vars.train_size} experiments/0${vars.experiment}/data/${vars.train} experiments/0${vars.experiment}/data/${vars.dev}"
  deps:
    - "experiments/0${vars.experiment}/data/${vars.dataset}"
    - "scripts/split_train_test.py"
  outputs:
    - "experiments/0${vars.experiment}/data/${vars.train}"
    - "experiments/0${vars.experiment}/data/${vars.dev}"

- name: "create-gazetteer"
  help: "Merge gazetter into single pattern file"
  script:
    - "python scripts/merge_gazetters.py ${vars.enabled_gazetteers} ${vars.person_gazetteer} ${vars.address_gazetteer} experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
  deps:
    - "${vars.person_gazetteer}"
    - "${vars.address_gazetteer}"
    - "scripts/merge_gazetters.py"
  outputs:
    - "experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"

- name: "convert"
  help: "Convert the data to spaCy's binary format"
  script:
    - "mkdir -p experiments/0${vars.experiment}/corpus"
    - "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.train} experiments/0${vars.experiment}/corpus/train.spacy"
    - "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.dev} experiments/0${vars.experiment}/corpus/dev.spacy"
    - "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.test} experiments/0${vars.experiment}/corpus/test.spacy"
  deps:
    - "experiments/0${vars.experiment}/data/${vars.train}"
    - "experiments/0${vars.experiment}/data/${vars.dev}"
    - "experiments/0${vars.experiment}/data/${vars.test}"
    - "scripts/convert.py"
  outputs:
    - "experiments/0${vars.experiment}/corpus/train.spacy"
    - "experiments/0${vars.experiment}/corpus/dev.spacy"
    - "experiments/0${vars.experiment}/corpus/test.spacy"

- name: "train"
  help: "Update the NER model"
  script:
    - "mkdir -p experiments/0${vars.experiment}/training"
    - "python -m spacy train experiments/0${vars.experiment}/configs/config.cfg --output experiments/0${vars.experiment}/training/ --paths.entity_ruler_patterns experiments/0${vars.experiment}/data/${vars.gazetteers_pattern} --paths.person_entity_ruler_patterns ${vars.person_entity_ruler_patterns} --paths.loc_entity_ruler_patterns ${vars.loc_entity_ruler_patterns} --paths.train experiments/0${vars.experiment}/corpus/train.spacy --paths.dev experiments/0${vars.experiment}/corpus/dev.spacy --gpu-id ${vars.gpu_id}"
  deps:
    - "experiments/0${vars.experiment}/configs/config.cfg"
    - "experiments/0${vars.experiment}/corpus/train.spacy"
    - "experiments/0${vars.experiment}/corpus/dev.spacy"
  outputs:
    - "experiments/0${vars.experiment}/training/model-best"

- name: "train-search"
  help: "Run customized training runs for hyperparameter search using [Weights & Biases Sweeps](https://docs.wandb.ai/guides/sweeps)"
  script:
    - "mkdir -p experiments/0${vars.experiment}/training"
    - "python scripts/train/wandb_sweeps.py experiments/0${vars.experiment}/configs/config.cfg experiments/0${vars.experiment}/training/ experiments/0${vars.experiment}/corpus/train.spacy experiments/0${vars.experiment}/corpus/dev.spacy experiments/0${vars.experiment}/corpus/train.spacy --gazetteer-path experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
  deps:
    - "scripts/train/wandb_sweeps.py"
    - "experiments/0${vars.experiment}/configs/config.cfg"
    - "experiments/0${vars.experiment}/corpus/train.spacy"
    - "experiments/0${vars.experiment}/corpus/dev.spacy"
  outputs:
    - "experiments/0${vars.experiment}/training/model-best"

- name: "evaluate"
  help: "Evaluate the model and export metrics"
  script:
    - "python -m spacy evaluate experiments/0${vars.experiment}/training/model-best experiments/0${vars.experiment}/corpus/test.spacy --output experiments/0${vars.experiment}/metrics.json"
  deps:
    - "experiments/0${vars.experiment}/corpus/test.spacy"
    - "experiments/0${vars.experiment}/training/model-best"
  outputs:
    - "experiments/0${vars.experiment}/metrics.json"

- name: package
  help: "Package the trained model as a pip package"
  script:
    - "python -m spacy package experiments/0${vars.experiment}/training/model-best packages --name ${vars.name} --version ${vars.version} --force"
  deps:
    - "experiments/0${vars.experiment}/training/model-best"
  outputs_no_cache:
    - "packages/${vars.lang}_${vars.name}-${vars.version}/dist/${vars.lang}_${vars.name}-${vars.version}.tar.gz"

- name: visualize-model
  help: Visualize the model's output interactively using Streamlit
  # https://github.com/explosion/spacy-streamlit/issues/55
  script:
    - 'python -m streamlit run scripts/visualize_model.py experiments/0${vars.experiment}/training/model-best "AUTOMATION: Não aceite cobrança na entrega se o pedido foi pago pelo app e nunca compartilhe dados pessoais em conversas de chat ou telefone.'
  deps:
    - "scripts/visualize_model.py"
    - "experiments/0${vars.experiment}/training/model-best"

@honnibal
Copy link
Member

Is the indentation right in 'commands' (maybe it's just a paste thing)? I'd have a quick look at how the file parses in a yaml-to-json converter, just to see if there's some stupid yaml whitespace thing.

@caiorcferreira
Copy link
Author

I've tried adding indentation, but the error persists. Per YAML spec, we can declare lists with or without indentation.

Try the following YAML at https://onlineyamltools.com/convert-yaml-to-json

list:
- one
- two

And the output will be:

{
  "list": [
    "one",
    "two"
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants