Skip to content

Commit

Permalink
Add separate textcat_multilabel project
Browse files Browse the repository at this point in the history
  • Loading branch information
adrianeboyd committed Feb 22, 2021
1 parent 62598b3 commit 88d1628
Show file tree
Hide file tree
Showing 13 changed files with 4,776 additions and 0 deletions.
3 changes: 3 additions & 0 deletions pipelines/textcat_multilabel_demo/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
corpus
packages
training
49 changes: 49 additions & 0 deletions pipelines/textcat_multilabel_demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) -->

# 🪐 spaCy Project: Demo Textcat (Text Classification)

A minimal demo textcat_multilabel project for spaCy v3.

## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
[spaCy projects documentation](https://spacy.io/usage/projects).

### ⏯ Commands

The following commands are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).
Commands are only re-run if their inputs have changed.

| Command | Description |
| --- | --- |
| `convert` | Convert the data to spaCy's binary format |
| `train` | Train the textcat model |
| `evaluate` | Evaluate the model and export metrics |
| `package` | Package the trained model as a pip package |
| `visualize-model` | Visualize the model's output interactively using Streamlit |

### ⏭ Workflows

The following workflows are defined by the project. They
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.

| Workflow | Steps |
| --- | --- |
| `all` | `convert` &rarr; `train` &rarr; `evaluate` &rarr; `package` |

### 🗂 Assets

The following assets are defined by the project. They can
be fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)
in the project directory.

| File | Source | Description |
| --- | --- | --- |
| [`assets/cooking-train.jsonl`](assets/cooking-train.jsonl) | Local | Demo training data |
| [`assets/cooking-dev.jsonl`](assets/cooking-dev.jsonl) | Local | Demo development data |

<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->
6 changes: 6 additions & 0 deletions pipelines/textcat_multilabel_demo/assets/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# This is needed to ensure that text-based assets included with project
# templates and cloned via Git end up with consistent line endings and
# the same checksums. It will prevent Git from converting line endings.
# Otherwise, a user cloning assets on Windows may end up with a different
# checksum due to different line endings.
* -text
428 changes: 428 additions & 0 deletions pipelines/textcat_multilabel_demo/assets/CC_BY-SA-4.0.txt

Large diffs are not rendered by default.

9 changes: 9 additions & 0 deletions pipelines/textcat_multilabel_demo/assets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
### Data Source

* https://cooking.stackexchange.com. The meta IDs link to the
original question as `https://cooking.stackexchange.com/questions/ID`, e.g.,
`https://cooking.stackexchange.com/questions/2` for the first instance.

### Data License

* CC BY-SA 4.0 ([`CC_BY-SA-4.0.txt`](CC_BY-SA-4.0.txt))
2,000 changes: 2,000 additions & 0 deletions pipelines/textcat_multilabel_demo/assets/cooking-dev.jsonl

Large diffs are not rendered by default.

2,000 changes: 2,000 additions & 0 deletions pipelines/textcat_multilabel_demo/assets/cooking-train.jsonl

Large diffs are not rendered by default.

140 changes: 140 additions & 0 deletions pipelines/textcat_multilabel_demo/configs/config.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["textcat_multilabel"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v1"
exclusive_classes = false
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.textcat_multilabel.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 64
rows = [2000,2000,1000,1000,1000,1000]
attrs = ["ORTH","LOWER","PREFIX","SUFFIX","SHAPE","ID"]
include_static_vectors = false

[components.textcat_multilabel.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 64
window_size = 1
maxout_pieces = 3
depth = 2

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1000
max_epochs = 0
max_steps = 2000
eval_frequency = 100
frozen_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null
cats_score = 1.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]
92 changes: 92 additions & 0 deletions pipelines/textcat_multilabel_demo/project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
title: "Demo Textcat (Text Classification)"
description: "A minimal demo textcat_multilabel project for spaCy v3."
# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
name: "textcat_multilabel_demo"
# Supported languages: all except ja, ko, th, vi, and zh, which would require
# custom tokenizer settings in config.cfg
lang: "en"
# Set your GPU ID, -1 is CPU
gpu_id: -1
version: "0.0.0"
train: "cooking-train.jsonl"
dev: "cooking-dev.jsonl"
config: "config.cfg"

# These are the directories that the project needs. The project CLI will make
# sure that they always exist.
directories: ["assets", "corpus", "configs", "training", "scripts", "packages"]

# Assets that should be downloaded or available in the directory. We're shipping
# them with the project, so they won't have to be downloaded.
assets:
- dest: "assets/${vars.train}"
description: "Training data from cooking.stackexchange.com"
- dest: "assets/${vars.dev}"
description: "Development data from cooking.stackexchange.com"

# Workflows are sequences of commands (see below) executed in order. You can
# run them via "spacy project run [workflow]". If a commands's inputs/outputs
# haven't changed, it won't be re-run.
workflows:
all:
- convert
- train
- evaluate
- package

# Project commands, specified in a style similar to CI config files (e.g. Azure
# pipelines). The name is the command name that lets you trigger the command
# via "spacy project run [command] [path]". The help message is optional and
# shown when executing "spacy project run [optional command] [path] --help".
commands:
- name: "convert"
help: "Convert the data to spaCy's binary format"
script:
- "python scripts/convert.py ${vars.lang} assets/${vars.train} corpus/train.spacy"
- "python scripts/convert.py ${vars.lang} assets/${vars.dev} corpus/dev.spacy"
deps:
- "assets/${vars.train}"
- "assets/${vars.dev}"
- "scripts/convert.py"
outputs:
- "corpus/train.spacy"
- "corpus/dev.spacy"

- name: "train"
help: "Train the textcat model"
script:
- "python -m spacy train configs/${vars.config} --output training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --nlp.lang ${vars.lang} --gpu-id ${vars.gpu_id}"
deps:
- "configs/${vars.config}"
- "corpus/train.spacy"
- "corpus/dev.spacy"
outputs:
- "training/model-best"

- name: "evaluate"
help: "Evaluate the model and export metrics"
script:
- "python -m spacy evaluate training/model-best corpus/dev.spacy --output training/metrics.json"
deps:
- "corpus/dev.spacy"
- "training/model-best"
outputs:
- "training/metrics.json"

- name: package
help: "Package the trained model as a pip package"
script:
- "python -m spacy package training/model-best packages --name ${vars.name} --version ${vars.version} --force"
deps:
- "training/model-best"
outputs_no_cache:
- "packages/en_${vars.name}-${vars.version}/dist/en_${vars.name}-${vars.version}.tar.gz"

- name: visualize-model
help: Visualize the model's output interactively using Streamlit
script:
- "streamlit run scripts/visualize_model.py training/model-best \"How can I get chewy chocolate chip cookies?\n<p>My chocolate chips cookies are always too crisp. How can I get chewy cookies, like those of Starbucks?</p>\n<hr/>\n<p>Thank you to everyone who has answered. So far the tip that had the biggest impact was to chill and rest the dough, however I also increased the brown sugar ratio and increased a bit the butter. Also adding maple syrup helped. </p>\""
deps:
- "scripts/visualize_model.py"
- "training/model-best"
2 changes: 2 additions & 0 deletions pipelines/textcat_multilabel_demo/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
spacy-streamlit>=1.0.0a0
streamlit
23 changes: 23 additions & 0 deletions pipelines/textcat_multilabel_demo/scripts/convert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
"""Convert textcat annotation from JSONL to spaCy v3 .spacy format."""
import srsly
import typer
import warnings
from pathlib import Path

import spacy
from spacy.tokens import DocBin


def convert(lang: str, input_path: Path, output_path: Path):
nlp = spacy.blank(lang)
docs = []
for line in srsly.read_jsonl(input_path):
doc = nlp.make_doc(line["text"])
doc.cats = line["cats"]
docs.append(doc)
db = DocBin(docs=docs)
db.to_disk(output_path)


if __name__ == "__main__":
typer.run(convert)
14 changes: 14 additions & 0 deletions pipelines/textcat_multilabel_demo/scripts/visualize_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import spacy_streamlit
import typer


def main(models: str, default_text: str):
models = [name.strip() for name in models.split(",")]
spacy_streamlit.visualize(models, default_text, visualizers=["textcat"])


if __name__ == "__main__":
try:
typer.run(main)
except SystemExit:
pass
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from spacy.cli.project.run import project_run
from spacy.cli.project.assets import project_assets
from pathlib import Path


def test_textcat_multilabel_demo_project():
root = Path(__file__).parent
project_assets(root)
project_run(root, "all", capture=True)
project_run(root, "package", capture=True)

0 comments on commit 88d1628

Please sign in to comment.