Skip to content
This repository was archived by the owner on Jan 10, 2023. It is now read-only.

Commit 93541e9

Browse files
authored
Integrate link graph into main wiki pipeline (#429)
1 parent 8d85040 commit 93541e9

19 files changed

+679
-366
lines changed

.bazelrc

-3
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,3 @@ build --cxxopt=-Wno-undefined-var-template
88
build --cxxopt=-Wno-attributes
99
build --spawn_strategy=standalone
1010

11-
# You may want to comment out the following line on older versions of GCC (< v5).
12-
build --cxxopt=-D_GLIBCXX_USE_CXX11_ABI=0
13-

README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,8 @@ intervening symbolic representation.
4444

4545
The SLING framework includes an efficient and scalable
4646
[frame store](doc/guide/frames.md) implementation as well as a
47-
[neural network JIT compiler](doc/guide/myelin.md) for fast parsing at runtime.
47+
[neural network JIT compiler](doc/guide/myelin.md) for fast training and
48+
parsing.
4849

4950
A more detailed description of the SLING parser can be found in this paper:
5051

@@ -56,8 +57,7 @@ A more detailed description of the SLING parser can be found in this paper:
5657
## More information ...
5758

5859
* [Installation and building](doc/guide/install.md)
59-
* [Training a parser](doc/guide/training.md)
60-
* [Running the parser](doc/guide/parsing.md)
60+
* [CASPAR frame semantics parser](doc/guide/caspar.md)
6161
* [Semantic frames](doc/guide/frames.md)
6262
* [SLING Python API](doc/guide/pyapi.md)
6363
* [Myelin neural network JIT compiler](doc/guide/myelin.md)

doc/guide/README.md

+6-2
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
# SLING Guides
22

33
* [SLING installation and building](install.md)
4-
* [Training a SLING parser](training.md)
5-
* [Parsing with SLING](parsing.md)
4+
* [CASPAR frame semantics parser](caspar.md)
65
* [SLING frames](frames.md)
76
* [SLING Python API](pyapi.md)
87
* [Myelin neural network JIT compiler](myelin.md)
98
* [Wikipedia and Wikidata processing](wikiflow.md)
109

10+
## Out-dated guides
11+
12+
* [Training a SLING parser](training.md)
13+
* [Parsing with SLING](parsing.md)
14+

doc/guide/caspar.md

+316
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
# CASPAR frame semantics parser
2+
3+
CASPAR is a frame semantics parser trained on OntoNotes 5 data. Mentions,
4+
entity types (partial), and
5+
[PropBank](https://propbank.github.io/) semantic role labels are extracted from
6+
this corpus to produce a frame semantic corpus in SLING document format.
7+
8+
We use the standard [CoNLL-2012](http://conll.cemantix.org/2012/data.html) split
9+
of the data to produce training, development, and test corpora.
10+
11+
## Preparing the training data
12+
13+
The LDC2013T19 OntoNotes 5 corpus is needed to produce the training data for
14+
CASPAR. This is licensed by LDC and you need an LDC license to use the corpus:
15+
16+
https://catalog.ldc.upenn.edu/LDC2013T19
17+
18+
To prepare the training data for the parser, place `LDC2013T19.tar.gz` in
19+
`local/data/corpora/ontonotes` and run the `make_corpus.sh` script:
20+
21+
```
22+
sling/nlp/parser/ontonotes/make_corpus.sh
23+
```
24+
25+
This script will perform the following steps to produce the training data:
26+
27+
* Unpack OntoNotes 5.
28+
* Download and unpack the CoNLL formated OntoNotes 5 data and tools.
29+
* Generate CoNLL files from the OntoNotes data.
30+
* Convert CoNLL files to SLING format.
31+
* Shuffle the training data.
32+
33+
This will put the training and evaluation data into `local/data/corpora/caspar`:
34+
* `train.rec` contains the training data.
35+
* `dev.rec` contains the developemnt data.
36+
* `test.rec` contains the evaluation data.
37+
* `train_shuffled.rec` contains the shuffled training data.
38+
39+
## Pre-trained word embeddings
40+
41+
The CASPAR parser uses pre-trained word embeddings which can be downloaded from
42+
here:
43+
```
44+
curl http://www.jbox.dk/sling/word2vec-32-embeddings.bin -o /tmp/word2vec-32-embeddings.bin
45+
```
46+
47+
These are 32 dimensional word embeddings trained on news text in
48+
[Mikolov's word2vec format](https://github.com/tmikolov/word2vec/blob/master/word2vec.c).
49+
50+
## Train a CASPAR parser model
51+
52+
The `sling/nlp/parser/tools/train_caspar.py` Python script contains an example of
53+
how to train the CASPAR parser model:
54+
55+
```python
56+
import sling
57+
import sling.flags as flags
58+
import sling.task.workflow as workflow
59+
60+
# Start up workflow system.
61+
flags.parse()
62+
workflow.startup()
63+
64+
# Create worflow.
65+
wf = workflow.Workflow("parser-training")
66+
67+
# Parser trainer inputs and outputs.
68+
training_corpus = wf.resource(
69+
"local/data/corpora/caspar/train_shuffled.rec",
70+
format="record/document"
71+
)
72+
73+
evaluation_corpus = wf.resource(
74+
"local/data/corpora/caspar/dev.rec",
75+
format="record/document"
76+
)
77+
78+
word_embeddings = wf.resource(
79+
"local/data/corpora/caspar/word2vec-32-embeddings.bin",
80+
format="embeddings"
81+
)
82+
83+
parser_model = wf.resource(
84+
"local/data/e/caspar/caspar.flow",
85+
format="flow"
86+
)
87+
88+
# Parser trainer task.
89+
trainer = wf.task("caspar-trainer")
90+
91+
trainer.add_params({
92+
"learning_rate": 1.0,
93+
"learning_rate_decay": 0.8,
94+
"clipping": 1,
95+
"optimizer": "sgd",
96+
"epochs": 50000,
97+
"batch_size": 32,
98+
"rampup": 120,
99+
"report_interval": 500
100+
})
101+
102+
trainer.attach_input("training_corpus", training_corpus)
103+
trainer.attach_input("evaluation_corpus", evaluation_corpus)
104+
trainer.attach_input("word_embeddings", word_embeddings)
105+
trainer.attach_output("model", parser_model)
106+
107+
# Run parser trainer.
108+
workflow.run(wf)
109+
110+
# Shut down.
111+
workflow.shutdown()
112+
```
113+
114+
This model takes ~90 minutes to train. It will output evaluation metrics
115+
each 500 epochs, and when it is done, the final parser model will be written
116+
to `local/data/e/caspar/caspar.flow`.
117+
118+
If you don't have access to OntoNotes 5, you can download a pre-trained model
119+
from [here](http://www.jbox.dk/sling/caspar.flow).
120+
121+
## Testing the CASPAR parser
122+
123+
SLING comes with a [parsing tool](../../sling/nlp/parser/tools/parse.cc)
124+
for annotating a corpus of documents with frames using a parser model,
125+
benchmarking this annotation process, and optionally evaluating the annotated
126+
frames against supplied gold frames.
127+
128+
This tool takes the following commandline arguments:
129+
130+
* `--parser` : This should point to a Myelin flow, e.g. one created by the
131+
parser trainer.
132+
* If `--text` is specified then the parser is run over the supplied text, and
133+
prints the annotated frame(s) in text mode, e.g.:
134+
```
135+
$ bazel-bin/sling/nlp/parser/tools/parse \
136+
--parser local/data/e/caspar/caspar.flow --text="Eric loves Hannah."
137+
138+
{=#1
139+
:document
140+
text: "Eric loves Hannah."
141+
tokens: [{=#2
142+
start: 0
143+
size: 4
144+
}, {=#3
145+
start: 5
146+
size: 5
147+
}, {=#4
148+
start: 11
149+
size: 6
150+
}, {=#5
151+
start: 17
152+
break: 0
153+
}]
154+
mention: {=#6
155+
begin: 0
156+
evokes: {=#7
157+
:PERSON
158+
}
159+
}
160+
mention: {=#8
161+
begin: 1
162+
evokes: {=#9
163+
:/pb/predicate
164+
/pb/ARG0: #7
165+
/pb/ARG1: {=#10
166+
:PERSON
167+
}
168+
}
169+
}
170+
mention: {=#11
171+
begin: 2
172+
evokes: #10
173+
}
174+
}
175+
176+
```
177+
* If `--benchmark` is specified then the parser is run on the document
178+
corpus specified via `--corpus`. This corpus should be prepared similarly to
179+
how the training/dev corpora were created. The processing can be limited to
180+
the first N documents by specifying `--maxdocs N`.
181+
182+
```
183+
$ bazel-bin/sling/nlp/parser/tools/parse --parser local/data/e/caspar/caspar.flow \
184+
--benchmark --corpus local/data/corpora/caspar/dev.rec
185+
[... I sling/nlp/parser/tools/parse.cc:131] Load parser from local/data/e/caspar/caspar.flow
186+
[... I sling/nlp/parser/tools/parse.cc:140] 34.7227 ms loading parser
187+
[... I sling/nlp/parser/tools/parse.cc:204] Benchmarking parser on local/data/corpora/caspar/dev.rec
188+
[... I sling/nlp/parser/tools/parse.cc:227] 9603 documents, 163104 tokens, 7970.69 tokens/sec
189+
```
190+
191+
If `--profile` is specified, the parser will run with profiling
192+
instrumentation enabled and output a detailed profile report with execution
193+
timing for each operation in the neural network.
194+
195+
* If `--evaluate` is specified then the tool expects `--corpora` to specify
196+
a corpora with gold frames. It then runs the parser model over a frame-less
197+
version of this corpora and evaluates the annotated frames vs the gold
198+
frames. Again, one can use `--maxdocs` to limit the evaluation to the first N
199+
documents.
200+
```
201+
$ bazel-bin/sling/nlp/parser/tools/parse --parser local/data/e/caspar/caspar.flow \
202+
--evaluate --corpus local/data/corpora/caspar/dev.rec
203+
[... I sling/nlp/parser/tools/parse.cc:131] Load parser from local/data/e/caspar/caspar.flow
204+
[... I sling/nlp/parser/tools/parse.cc:140] 34.7368 ms loading parser
205+
[... I sling/nlp/parser/tools/parse.cc:235] Evaluating parser on local/data/corpora/caspar/dev.rec
206+
SPAN_P+=76898
207+
SPAN_P-=6800
208+
SPAN_R+=76898
209+
SPAN_R-=6192
210+
SPAN_Precision=91.8755
211+
SPAN_Recall=92.5478
212+
SPAN_F1=92.2105
213+
FRAME_P+=77866
214+
FRAME_P-=5859
215+
FRAME_R+=77859
216+
FRAME_R-=5233
217+
FRAME_Precision=93.0021
218+
FRAME_Recall=93.7022
219+
FRAME_F1=93.3508
220+
TYPE_P+=74277
221+
TYPE_P-=9448
222+
TYPE_R+=74275
223+
TYPE_R-=8817
224+
TYPE_Precision=88.7154
225+
TYPE_Recall=89.3889
226+
TYPE_F1=89.0509
227+
ROLE_P+=37762
228+
ROLE_P-=16848
229+
ROLE_R+=37755
230+
ROLE_R-=16397
231+
ROLE_Precision=69.1485
232+
ROLE_Recall=69.7204
233+
ROLE_F1=69.4333
234+
LABEL_P+=0
235+
LABEL_P-=0
236+
LABEL_R+=0
237+
LABEL_R-=0
238+
LABEL_Precision=0
239+
LABEL_Recall=0
240+
LABEL_F1=0
241+
SLOT_P+=112039
242+
SLOT_P-=26296
243+
SLOT_R+=112030
244+
SLOT_R-=25214
245+
SLOT_Precision=80.9911
246+
SLOT_Recall=81.6283
247+
SLOT_F1=81.3085
248+
COMBINED_P+=266803
249+
COMBINED_P-=38955
250+
COMBINED_R+=266787
251+
COMBINED_R-=36639
252+
COMBINED_Precision=87.2595
253+
COMBINED_Recall=87.9249
254+
COMBINED_F1=87.591
255+
#GOLDEN_SPANS=83090
256+
#PREDICTED_SPANS=83698
257+
#GOLDEN_FRAMES=83092
258+
#PREDICTED_FRAMES=83725
259+
```
260+
261+
## Using the CASPAR parser in Python
262+
263+
You can use the parser in Python by using the `Parser` class in the Python SLING
264+
API, e.g.:
265+
```
266+
import sling
267+
268+
parser = sling.Parser("local/data/e/caspar/caspar.flow")
269+
270+
text = input("text: ")
271+
doc = parser.parse(text)
272+
print(doc.frame.data(pretty=True))
273+
for m in doc.mentions:
274+
print("mention", doc.phrase(m.begin, m.end))
275+
```
276+
277+
## Using the CASPAR parser in C++
278+
279+
SLING has a C++ API for the parser. The model can be loaded and initialized in
280+
the following way:
281+
282+
```c++
283+
#include "sling/frame/store.h"
284+
#include "sling/nlp/document/document-tokenizer.h"
285+
#include "sling/nlp/parser/parser.h"
286+
287+
// Load parser model.
288+
sling::Store commons;
289+
sling::nlp::Parser parser;
290+
parser.Load(&commons, "local/data/e/caspar/caspar.flow");
291+
commons.Freeze();
292+
293+
// Create document tokenizer.
294+
sling::nlp::DocumentTokenizer tokenizer;
295+
```
296+
297+
In order to parse some text, it first needs to be tokenized. The document with
298+
text, tokens, and frames is stored in a local document frame store.
299+
300+
```c++
301+
// Create frame store for document.
302+
sling::Store store(&commons);
303+
sling::nlp::Document document(&store);
304+
305+
// Tokenize text.
306+
string text = "Eric loves Hannah.";
307+
tokenizer.Tokenize(&document, text);
308+
309+
// Parse document.
310+
parser.Parse(&document);
311+
document.Update();
312+
313+
// Output document annotations.
314+
std::cout << sling::ToText(document.top(), 2);
315+
```
316+

doc/guide/frames.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction <a name="intro">
44

5-
SLING is a framework for storing, inspecting, manipulating, and transporting
5+
SLING has a framework for storing, inspecting, manipulating, and transporting
66
semantic frames compactly and efficiently. The frames can be used both for
77
linguistic annotations as well as knowledge representations. SLING is not tied
88
to any particular frame semantic theory or knowledge ontology, but allows you

0 commit comments

Comments
 (0)