-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weasel serializes commands
field as string
#94
Comments
Thanks. Could you paste your workflow file? This can't be the right error behaviour no matter what, but I'm trying to figure out whether it's doing this on a workflow that should work, or whether it's just heading down the wrong error path. |
My workflow is a bit customized in how it uses variables, but it worked for a time. I wonder if my conda environment changed some version that led to this new behavior. Workflow file: title: "NER portuguese chat"
description: "Project tunning NER component in portuguese model using chat corpus"
# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
name: "core_chat_lg"
lang: "pt"
pipeline: "pt_core_news_lg"
version: "0.0.0"
dataset: "raw.json"
train: "train.json"
dev: "dev.json"
test: "test.json"
test_data: "assets/datasets/chats/sample-chats-manual-labeled-test.json"
input_data: "assets/datasets/chats/sample-chats-manual-labeled-train.json"
experiment: "01"
train_size: 0.8
enabled_gazetteers: "null"
person_gazetteer: "assets/datasets/names_surnames/pt_br_names-gazetteer.jsonl"
address_gazetteer: "assets/datasets/addresses/pt_br_address-gazetter.jsonl"
person_entity_ruler_patterns: "null"
loc_entity_ruler_patterns: "null"
gazetteers_pattern: "gazetteers_patterns.jsonl"
# Set your GPU ID, -1 is CPU
gpu_id: -1
# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["assets", "scripts", "experiments", "baseline", "packages"]
# Assets that should be downloaded or available in the directory. We're shipping
# them with the project, so they won't have to be downloaded.
assets:
- dest: "assets/train.json"
description: "Training data"
- dest: "assets/dev.json"
description: "Development data"
# Workflows are sequences of commands (see below) executed in order. You can
# run them via "spacy project run [workflow]". If a commands's inputs/outputs
# haven't changed, it won't be re-run.
workflows:
experiment:
- fetch-data
- split-data
- create-gazetteer
- convert
- train
- evaluate
experiment_search:
- fetch-data
- split-data
- create-gazetteer
- convert
- train-search
- evaluate
experiment_new:
- setup_experiment
- create-config
# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
- name: "download"
help: "Download the pretrained pipeline"
script:
- "python -m spacy download ${vars.pipeline}"
- name: "setup_experiment"
help: "Setup experiment directory structure"
script:
- "mkdir -p experiments/0${vars.experiment}/data experiments/0${vars.experiment}/configs experiments/0${vars.experiment}/training experiments/0${vars.experiment}/corpus experiments/0${vars.experiment}/scripts"
- "touch experiments/0${vars.experiment}/README.md"
- name: "create-config"
help: "Create a config for updating only NER from an existing pipeline"
script:
- "python scripts/create_config.py ${vars.pipeline} ner experiments/0${vars.experiment}/data/${vars.gazetteers_pattern} ${vars.enabled_gazetteers} experiments/0${vars.experiment}/configs/config.cfg"
deps:
- "scripts/create_config.py"
outputs:
- "experiments/0${vars.experiment}/configs/config.cfg"
- name: "fetch-data"
help: "Fetch the training and test data"
script:
- "cp ${vars.input_data} experiments/0${vars.experiment}/data/${vars.dataset}"
- "cp ${vars.test_data} experiments/0${vars.experiment}/data/${vars.test}"
deps:
- "${vars.input_data}"
- "${vars.test_data}"
outputs:
- "experiments/0${vars.experiment}/data/${vars.dataset}"
- "experiments/0${vars.experiment}/data/${vars.test}"
- name: "split-data"
help: "Split the data into training and eval sets, and copy the test data"
script:
- "python scripts/split_train_test.py experiments/0${vars.experiment}/data/${vars.dataset} ${vars.train_size} experiments/0${vars.experiment}/data/${vars.train} experiments/0${vars.experiment}/data/${vars.dev}"
deps:
- "experiments/0${vars.experiment}/data/${vars.dataset}"
- "scripts/split_train_test.py"
outputs:
- "experiments/0${vars.experiment}/data/${vars.train}"
- "experiments/0${vars.experiment}/data/${vars.dev}"
- name: "create-gazetteer"
help: "Merge gazetter into single pattern file"
script:
- "python scripts/merge_gazetters.py ${vars.enabled_gazetteers} ${vars.person_gazetteer} ${vars.address_gazetteer} experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
deps:
- "${vars.person_gazetteer}"
- "${vars.address_gazetteer}"
- "scripts/merge_gazetters.py"
outputs:
- "experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
- name: "convert"
help: "Convert the data to spaCy's binary format"
script:
- "mkdir -p experiments/0${vars.experiment}/corpus"
- "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.train} experiments/0${vars.experiment}/corpus/train.spacy"
- "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.dev} experiments/0${vars.experiment}/corpus/dev.spacy"
- "python scripts/convert.py ${vars.lang} experiments/0${vars.experiment}/data/${vars.test} experiments/0${vars.experiment}/corpus/test.spacy"
deps:
- "experiments/0${vars.experiment}/data/${vars.train}"
- "experiments/0${vars.experiment}/data/${vars.dev}"
- "experiments/0${vars.experiment}/data/${vars.test}"
- "scripts/convert.py"
outputs:
- "experiments/0${vars.experiment}/corpus/train.spacy"
- "experiments/0${vars.experiment}/corpus/dev.spacy"
- "experiments/0${vars.experiment}/corpus/test.spacy"
- name: "train"
help: "Update the NER model"
script:
- "mkdir -p experiments/0${vars.experiment}/training"
- "python -m spacy train experiments/0${vars.experiment}/configs/config.cfg --output experiments/0${vars.experiment}/training/ --paths.entity_ruler_patterns experiments/0${vars.experiment}/data/${vars.gazetteers_pattern} --paths.person_entity_ruler_patterns ${vars.person_entity_ruler_patterns} --paths.loc_entity_ruler_patterns ${vars.loc_entity_ruler_patterns} --paths.train experiments/0${vars.experiment}/corpus/train.spacy --paths.dev experiments/0${vars.experiment}/corpus/dev.spacy --gpu-id ${vars.gpu_id}"
deps:
- "experiments/0${vars.experiment}/configs/config.cfg"
- "experiments/0${vars.experiment}/corpus/train.spacy"
- "experiments/0${vars.experiment}/corpus/dev.spacy"
outputs:
- "experiments/0${vars.experiment}/training/model-best"
- name: "train-search"
help: "Run customized training runs for hyperparameter search using [Weights & Biases Sweeps](https://docs.wandb.ai/guides/sweeps)"
script:
- "mkdir -p experiments/0${vars.experiment}/training"
- "python scripts/train/wandb_sweeps.py experiments/0${vars.experiment}/configs/config.cfg experiments/0${vars.experiment}/training/ experiments/0${vars.experiment}/corpus/train.spacy experiments/0${vars.experiment}/corpus/dev.spacy experiments/0${vars.experiment}/corpus/train.spacy --gazetteer-path experiments/0${vars.experiment}/data/${vars.gazetteers_pattern}"
deps:
- "scripts/train/wandb_sweeps.py"
- "experiments/0${vars.experiment}/configs/config.cfg"
- "experiments/0${vars.experiment}/corpus/train.spacy"
- "experiments/0${vars.experiment}/corpus/dev.spacy"
outputs:
- "experiments/0${vars.experiment}/training/model-best"
- name: "evaluate"
help: "Evaluate the model and export metrics"
script:
- "python -m spacy evaluate experiments/0${vars.experiment}/training/model-best experiments/0${vars.experiment}/corpus/test.spacy --output experiments/0${vars.experiment}/metrics.json"
deps:
- "experiments/0${vars.experiment}/corpus/test.spacy"
- "experiments/0${vars.experiment}/training/model-best"
outputs:
- "experiments/0${vars.experiment}/metrics.json"
- name: package
help: "Package the trained model as a pip package"
script:
- "python -m spacy package experiments/0${vars.experiment}/training/model-best packages --name ${vars.name} --version ${vars.version} --force"
deps:
- "experiments/0${vars.experiment}/training/model-best"
outputs_no_cache:
- "packages/${vars.lang}_${vars.name}-${vars.version}/dist/${vars.lang}_${vars.name}-${vars.version}.tar.gz"
- name: visualize-model
help: Visualize the model's output interactively using Streamlit
# https://github.com/explosion/spacy-streamlit/issues/55
script:
- 'python -m streamlit run scripts/visualize_model.py experiments/0${vars.experiment}/training/model-best "AUTOMATION: Não aceite cobrança na entrega se o pedido foi pago pelo app e nunca compartilhe dados pessoais em conversas de chat ou telefone.'
deps:
- "scripts/visualize_model.py"
- "experiments/0${vars.experiment}/training/model-best" |
Is the indentation right in 'commands' (maybe it's just a paste thing)? I'd have a quick look at how the file parses in a yaml-to-json converter, just to see if there's some stupid yaml whitespace thing. |
I've tried adding indentation, but the error persists. Per YAML spec, we can declare lists with or without indentation. Try the following YAML at https://onlineyamltools.com/convert-yaml-to-json list:
- one
- two And the output will be: {
"list": [
"one",
"two"
]
} |
Description
It looks like Weasel is reading
commands
as a string rather than a list. This causes access to thename
field to raise the errorTypeError: string indices must be integers
.Environment
Name Version Build Channel
weasel 0.3.4 py39hca03da5_0
Error
The text was updated successfully, but these errors were encountered: