|
| 1 | +# CASPAR frame semantics parser |
| 2 | + |
| 3 | +CASPAR is a frame semantics parser trained on OntoNotes 5 data. Mentions, |
| 4 | +entity types (partial), and |
| 5 | +[PropBank](https://propbank.github.io/) semantic role labels are extracted from |
| 6 | +this corpus to produce a frame semantic corpus in SLING document format. |
| 7 | + |
| 8 | +We use the standard [CoNLL-2012](http://conll.cemantix.org/2012/data.html) split |
| 9 | +of the data to produce training, development, and test corpora. |
| 10 | + |
| 11 | +## Preparing the training data |
| 12 | + |
| 13 | +The LDC2013T19 OntoNotes 5 corpus is needed to produce the training data for |
| 14 | +CASPAR. This is licensed by LDC and you need an LDC license to use the corpus: |
| 15 | + |
| 16 | +https://catalog.ldc.upenn.edu/LDC2013T19 |
| 17 | + |
| 18 | +To prepare the training data for the parser, place `LDC2013T19.tar.gz` in |
| 19 | +`local/data/corpora/ontonotes` and run the `make_corpus.sh` script: |
| 20 | + |
| 21 | +``` |
| 22 | +sling/nlp/parser/ontonotes/make_corpus.sh |
| 23 | +``` |
| 24 | + |
| 25 | +This script will perform the following steps to produce the training data: |
| 26 | + |
| 27 | +* Unpack OntoNotes 5. |
| 28 | +* Download and unpack the CoNLL formated OntoNotes 5 data and tools. |
| 29 | +* Generate CoNLL files from the OntoNotes data. |
| 30 | +* Convert CoNLL files to SLING format. |
| 31 | +* Shuffle the training data. |
| 32 | + |
| 33 | +This will put the training and evaluation data into `local/data/corpora/caspar`: |
| 34 | +* `train.rec` contains the training data. |
| 35 | +* `dev.rec` contains the developemnt data. |
| 36 | +* `test.rec` contains the evaluation data. |
| 37 | +* `train_shuffled.rec` contains the shuffled training data. |
| 38 | + |
| 39 | +## Pre-trained word embeddings |
| 40 | + |
| 41 | +The CASPAR parser uses pre-trained word embeddings which can be downloaded from |
| 42 | +here: |
| 43 | +``` |
| 44 | +curl http://www.jbox.dk/sling/word2vec-32-embeddings.bin -o /tmp/word2vec-32-embeddings.bin |
| 45 | +``` |
| 46 | + |
| 47 | +These are 32 dimensional word embeddings trained on news text in |
| 48 | +[Mikolov's word2vec format](https://github.com/tmikolov/word2vec/blob/master/word2vec.c). |
| 49 | + |
| 50 | +## Train a CASPAR parser model |
| 51 | + |
| 52 | +The `sling/nlp/parser/tools/train_caspar.py` Python script contains an example of |
| 53 | +how to train the CASPAR parser model: |
| 54 | + |
| 55 | +```python |
| 56 | +import sling |
| 57 | +import sling.flags as flags |
| 58 | +import sling.task.workflow as workflow |
| 59 | + |
| 60 | +# Start up workflow system. |
| 61 | +flags.parse() |
| 62 | +workflow.startup() |
| 63 | + |
| 64 | +# Create worflow. |
| 65 | +wf = workflow.Workflow("parser-training") |
| 66 | + |
| 67 | +# Parser trainer inputs and outputs. |
| 68 | +training_corpus = wf.resource( |
| 69 | + "local/data/corpora/caspar/train_shuffled.rec", |
| 70 | + format="record/document" |
| 71 | +) |
| 72 | + |
| 73 | +evaluation_corpus = wf.resource( |
| 74 | + "local/data/corpora/caspar/dev.rec", |
| 75 | + format="record/document" |
| 76 | +) |
| 77 | + |
| 78 | +word_embeddings = wf.resource( |
| 79 | + "local/data/corpora/caspar/word2vec-32-embeddings.bin", |
| 80 | + format="embeddings" |
| 81 | +) |
| 82 | + |
| 83 | +parser_model = wf.resource( |
| 84 | + "local/data/e/caspar/caspar.flow", |
| 85 | + format="flow" |
| 86 | +) |
| 87 | + |
| 88 | +# Parser trainer task. |
| 89 | +trainer = wf.task("caspar-trainer") |
| 90 | + |
| 91 | +trainer.add_params({ |
| 92 | + "learning_rate": 1.0, |
| 93 | + "learning_rate_decay": 0.8, |
| 94 | + "clipping": 1, |
| 95 | + "optimizer": "sgd", |
| 96 | + "epochs": 50000, |
| 97 | + "batch_size": 32, |
| 98 | + "rampup": 120, |
| 99 | + "report_interval": 500 |
| 100 | +}) |
| 101 | + |
| 102 | +trainer.attach_input("training_corpus", training_corpus) |
| 103 | +trainer.attach_input("evaluation_corpus", evaluation_corpus) |
| 104 | +trainer.attach_input("word_embeddings", word_embeddings) |
| 105 | +trainer.attach_output("model", parser_model) |
| 106 | + |
| 107 | +# Run parser trainer. |
| 108 | +workflow.run(wf) |
| 109 | + |
| 110 | +# Shut down. |
| 111 | +workflow.shutdown() |
| 112 | +``` |
| 113 | + |
| 114 | +This model takes ~90 minutes to train. It will output evaluation metrics |
| 115 | +each 500 epochs, and when it is done, the final parser model will be written |
| 116 | +to `local/data/e/caspar/caspar.flow`. |
| 117 | + |
| 118 | +If you don't have access to OntoNotes 5, you can download a pre-trained model |
| 119 | +from [here](http://www.jbox.dk/sling/caspar.flow). |
| 120 | + |
| 121 | +## Testing the CASPAR parser |
| 122 | + |
| 123 | +SLING comes with a [parsing tool](../../sling/nlp/parser/tools/parse.cc) |
| 124 | +for annotating a corpus of documents with frames using a parser model, |
| 125 | +benchmarking this annotation process, and optionally evaluating the annotated |
| 126 | +frames against supplied gold frames. |
| 127 | + |
| 128 | +This tool takes the following commandline arguments: |
| 129 | + |
| 130 | +* `--parser` : This should point to a Myelin flow, e.g. one created by the |
| 131 | + parser trainer. |
| 132 | +* If `--text` is specified then the parser is run over the supplied text, and |
| 133 | + prints the annotated frame(s) in text mode, e.g.: |
| 134 | + ``` |
| 135 | + $ bazel-bin/sling/nlp/parser/tools/parse \ |
| 136 | + --parser local/data/e/caspar/caspar.flow --text="Eric loves Hannah." |
| 137 | +
|
| 138 | + {=#1 |
| 139 | + :document |
| 140 | + text: "Eric loves Hannah." |
| 141 | + tokens: [{=#2 |
| 142 | + start: 0 |
| 143 | + size: 4 |
| 144 | + }, {=#3 |
| 145 | + start: 5 |
| 146 | + size: 5 |
| 147 | + }, {=#4 |
| 148 | + start: 11 |
| 149 | + size: 6 |
| 150 | + }, {=#5 |
| 151 | + start: 17 |
| 152 | + break: 0 |
| 153 | + }] |
| 154 | + mention: {=#6 |
| 155 | + begin: 0 |
| 156 | + evokes: {=#7 |
| 157 | + :PERSON |
| 158 | + } |
| 159 | + } |
| 160 | + mention: {=#8 |
| 161 | + begin: 1 |
| 162 | + evokes: {=#9 |
| 163 | + :/pb/predicate |
| 164 | + /pb/ARG0: #7 |
| 165 | + /pb/ARG1: {=#10 |
| 166 | + :PERSON |
| 167 | + } |
| 168 | + } |
| 169 | + } |
| 170 | + mention: {=#11 |
| 171 | + begin: 2 |
| 172 | + evokes: #10 |
| 173 | + } |
| 174 | + } |
| 175 | +
|
| 176 | + ``` |
| 177 | +* If `--benchmark` is specified then the parser is run on the document |
| 178 | + corpus specified via `--corpus`. This corpus should be prepared similarly to |
| 179 | + how the training/dev corpora were created. The processing can be limited to |
| 180 | + the first N documents by specifying `--maxdocs N`. |
| 181 | + |
| 182 | + ``` |
| 183 | + $ bazel-bin/sling/nlp/parser/tools/parse --parser local/data/e/caspar/caspar.flow \ |
| 184 | + --benchmark --corpus local/data/corpora/caspar/dev.rec |
| 185 | + [... I sling/nlp/parser/tools/parse.cc:131] Load parser from local/data/e/caspar/caspar.flow |
| 186 | + [... I sling/nlp/parser/tools/parse.cc:140] 34.7227 ms loading parser |
| 187 | + [... I sling/nlp/parser/tools/parse.cc:204] Benchmarking parser on local/data/corpora/caspar/dev.rec |
| 188 | + [... I sling/nlp/parser/tools/parse.cc:227] 9603 documents, 163104 tokens, 7970.69 tokens/sec |
| 189 | + ``` |
| 190 | + |
| 191 | + If `--profile` is specified, the parser will run with profiling |
| 192 | + instrumentation enabled and output a detailed profile report with execution |
| 193 | + timing for each operation in the neural network. |
| 194 | + |
| 195 | +* If `--evaluate` is specified then the tool expects `--corpora` to specify |
| 196 | + a corpora with gold frames. It then runs the parser model over a frame-less |
| 197 | + version of this corpora and evaluates the annotated frames vs the gold |
| 198 | + frames. Again, one can use `--maxdocs` to limit the evaluation to the first N |
| 199 | + documents. |
| 200 | + ``` |
| 201 | + $ bazel-bin/sling/nlp/parser/tools/parse --parser local/data/e/caspar/caspar.flow \ |
| 202 | + --evaluate --corpus local/data/corpora/caspar/dev.rec |
| 203 | + [... I sling/nlp/parser/tools/parse.cc:131] Load parser from local/data/e/caspar/caspar.flow |
| 204 | + [... I sling/nlp/parser/tools/parse.cc:140] 34.7368 ms loading parser |
| 205 | + [... I sling/nlp/parser/tools/parse.cc:235] Evaluating parser on local/data/corpora/caspar/dev.rec |
| 206 | + SPAN_P+=76898 |
| 207 | + SPAN_P-=6800 |
| 208 | + SPAN_R+=76898 |
| 209 | + SPAN_R-=6192 |
| 210 | + SPAN_Precision=91.8755 |
| 211 | + SPAN_Recall=92.5478 |
| 212 | + SPAN_F1=92.2105 |
| 213 | + FRAME_P+=77866 |
| 214 | + FRAME_P-=5859 |
| 215 | + FRAME_R+=77859 |
| 216 | + FRAME_R-=5233 |
| 217 | + FRAME_Precision=93.0021 |
| 218 | + FRAME_Recall=93.7022 |
| 219 | + FRAME_F1=93.3508 |
| 220 | + TYPE_P+=74277 |
| 221 | + TYPE_P-=9448 |
| 222 | + TYPE_R+=74275 |
| 223 | + TYPE_R-=8817 |
| 224 | + TYPE_Precision=88.7154 |
| 225 | + TYPE_Recall=89.3889 |
| 226 | + TYPE_F1=89.0509 |
| 227 | + ROLE_P+=37762 |
| 228 | + ROLE_P-=16848 |
| 229 | + ROLE_R+=37755 |
| 230 | + ROLE_R-=16397 |
| 231 | + ROLE_Precision=69.1485 |
| 232 | + ROLE_Recall=69.7204 |
| 233 | + ROLE_F1=69.4333 |
| 234 | + LABEL_P+=0 |
| 235 | + LABEL_P-=0 |
| 236 | + LABEL_R+=0 |
| 237 | + LABEL_R-=0 |
| 238 | + LABEL_Precision=0 |
| 239 | + LABEL_Recall=0 |
| 240 | + LABEL_F1=0 |
| 241 | + SLOT_P+=112039 |
| 242 | + SLOT_P-=26296 |
| 243 | + SLOT_R+=112030 |
| 244 | + SLOT_R-=25214 |
| 245 | + SLOT_Precision=80.9911 |
| 246 | + SLOT_Recall=81.6283 |
| 247 | + SLOT_F1=81.3085 |
| 248 | + COMBINED_P+=266803 |
| 249 | + COMBINED_P-=38955 |
| 250 | + COMBINED_R+=266787 |
| 251 | + COMBINED_R-=36639 |
| 252 | + COMBINED_Precision=87.2595 |
| 253 | + COMBINED_Recall=87.9249 |
| 254 | + COMBINED_F1=87.591 |
| 255 | + #GOLDEN_SPANS=83090 |
| 256 | + #PREDICTED_SPANS=83698 |
| 257 | + #GOLDEN_FRAMES=83092 |
| 258 | + #PREDICTED_FRAMES=83725 |
| 259 | + ``` |
| 260 | + |
| 261 | +## Using the CASPAR parser in Python |
| 262 | + |
| 263 | +You can use the parser in Python by using the `Parser` class in the Python SLING |
| 264 | +API, e.g.: |
| 265 | +``` |
| 266 | +import sling |
| 267 | +
|
| 268 | +parser = sling.Parser("local/data/e/caspar/caspar.flow") |
| 269 | +
|
| 270 | +text = input("text: ") |
| 271 | +doc = parser.parse(text) |
| 272 | +print(doc.frame.data(pretty=True)) |
| 273 | +for m in doc.mentions: |
| 274 | + print("mention", doc.phrase(m.begin, m.end)) |
| 275 | +``` |
| 276 | + |
| 277 | +## Using the CASPAR parser in C++ |
| 278 | + |
| 279 | +SLING has a C++ API for the parser. The model can be loaded and initialized in |
| 280 | +the following way: |
| 281 | + |
| 282 | +```c++ |
| 283 | +#include "sling/frame/store.h" |
| 284 | +#include "sling/nlp/document/document-tokenizer.h" |
| 285 | +#include "sling/nlp/parser/parser.h" |
| 286 | + |
| 287 | +// Load parser model. |
| 288 | +sling::Store commons; |
| 289 | +sling::nlp::Parser parser; |
| 290 | +parser.Load(&commons, "local/data/e/caspar/caspar.flow"); |
| 291 | +commons.Freeze(); |
| 292 | + |
| 293 | +// Create document tokenizer. |
| 294 | +sling::nlp::DocumentTokenizer tokenizer; |
| 295 | +``` |
| 296 | + |
| 297 | +In order to parse some text, it first needs to be tokenized. The document with |
| 298 | +text, tokens, and frames is stored in a local document frame store. |
| 299 | + |
| 300 | +```c++ |
| 301 | +// Create frame store for document. |
| 302 | +sling::Store store(&commons); |
| 303 | +sling::nlp::Document document(&store); |
| 304 | + |
| 305 | +// Tokenize text. |
| 306 | +string text = "Eric loves Hannah."; |
| 307 | +tokenizer.Tokenize(&document, text); |
| 308 | + |
| 309 | +// Parse document. |
| 310 | +parser.Parse(&document); |
| 311 | +document.Update(); |
| 312 | + |
| 313 | +// Output document annotations. |
| 314 | +std::cout << sling::ToText(document.top(), 2); |
| 315 | +``` |
| 316 | +
|
0 commit comments