Copyright (C) 2011-2017 mailto:[email protected]
NatLang is an English parser with an extensible grammar. It generates abstract syntax trees for all possible interpretations of an English sentence accepted by a grammar. The algorithm is completely deterministic. No training data is required.
This project is no longer being maintained. See new version here: parse-english
It works as follows:
- The user inputs a sentence.
- WordNet identifies one or more POS identities for each word in the sentence.
- All POS configurations of the sentence are evaluated using the Yacc-generated parser.
- A parse tree is generated for each successful parse.
For example:
The sentence "eats shoots and leaves" has four interpretations depending on the POS of "shoots" and "leaves".
eats shoots and leaves | | | | (V)---(V)----(C)-(V) \ / \ *-(N)--* (N) Path #1: {V V C V} Path #2: {V V C N} Path #3: {V N C V} Path #4: {V N C N}
The sentence "flying saucers are dangerous" has two interpretations depending on the POS of "flying".
flying saucers are dangerous | | | | (Adj)--(N)-----(Aux)-(Adj) / (V)--* Path #1: {Adj N Aux Adj} Path #2: {V N Aux Adj}
TODO: Additional analyses passes can be applied to the generated trees for further processing.
input:
the quick brown fox jumps over the lazy dog
output:
best solution (5th tree from left to right):
English is a non-context-free language. That means the same word used in different contexts can have different meanings. But Yacc cannot interpret the same lexer terminal differently if it is represented using the same lexer terminal ID. When this happens, it is necessary to split ambiguous terminals.
To do so:
- Identify lexer terminal with ambiguous meaning.
- Identify parser rules that use the lexer terminals with ambiguous meaning, and assign to each use case a different lexer terminal ID.
- Take advantage of stateful lexing to return a different lexer terminal ID for the same lexer terminal.
For example:
The sentence "She and I run and he jumps and shouts." has three conjugations "and".
A possible parse tree may look like this:
(S) | *-------*-------* | | | (S) | (S) | | | *---*---* | *---*----* | | | | | (NP) (VP) | (NP) (VP) | | | | | *---*---* | | | *----*----* | | | | | | | | | (N) (C) (N) (V) (C) (N) (V) (C) (V) | | | | | | | | | She and I run and he jumps and shouts.
Yacc chokes on this input due to the ambiguity of "and".
The solution is to split "and" into multiple lexer terminals, each representing a different abstraction level in the grammar.
- C_NP for noun-part level conjugations.
- C_VP for verb-part level conjugations.
- C_S for sentence level conjugations.
And to try all 27 permutations to see which ones work.
She and I run and he jumps and shouts. | | | | | | | | | (N) (C#1) (N) (V) (C#2) (N) (V) (C#3) (V) C#1 C#2 C#3 | | | Path #1: {C_NP C_NP C_NP} -- fail! Path #2: {C_NP C_NP C_VP} -- fail! Path #3: {C_NP C_NP C_S} -- fail! Path #4: {C_NP C_VP C_NP} -- fail! Path #5: {C_NP C_VP C_VP} -- fail! Path #6: {C_NP C_VP C_S} -- fail! Path #7: {C_NP C_S C_NP} -- fail! Path #8: {C_NP C_S C_VP} -- success! ... Path #27: {C_S C_S C_S} -- fail!
Here, path #8's POS configuration results in a successful parse.
She and I run and he jumps and shouts. | | | | | | | | | (N) (C_NP) (N) (V) (C_S) (N) (V) (C_VP) (V)
./app/bin/NatLang -e "the quick brown fox jumps over the lazy dog" -d | dot -Tpng -oast_fox.png
Unix tools and 3rd party components (accessible from $PATH):
gcc flex bison wordnet valgrind cppcheck doxygen graphviz ticpp
Environment variables:
- $INCLUDE_PATH_EXTERN -- where "ticpp/ticpp.h" resides
- $LIB_PATH_EXTERN -- where "libticppd.a" resides
- Present tense
- Progressive tense
- Past tense
- Past perfect tense
- Passive voice not supported.
- Questions not supported.
- Conditionals not supported.
- WordNet does not provide POS look-up for inflected verb forms and mechanical words such as prepositions, leading to a reliance on hard-coded POS definitions in the lexer for some words.
- A brute force algorithm tries all supported interpretations of a sentence. This is slow for long sentences.
- BNF rules are suitable for specifying constituent-based phrase structure grammars, but are a poor fit for expressing non-local dependencies.
target | action |
---|---|
all | make binaries |
test | all + run tests |
pure | test + use valgrind to check for memory leaks |
dot | test + generate .png graph for tests |
lint | use cppcheck to perform static analysis on .cpp files |
doc | use doxygen to generate documentation |
xml | test + generate .xml for tests |
import | test + use ticpp to serialize-to/deserialize-from xml |
clean | remove all intermediate files |
- "Part-of-speech tagging"
- http://en.wikipedia.org/wiki/Part-of-speech_tagging
- "Princeton WordNet"
- http://wordnet.princeton.edu/
- "Syntactic Theory: A Unified Approach"
- ISBN: 0340706104
- "Enju - A fast, accurate, and deep parser for English"
- http://www.nactem.ac.uk/enju/
Lex, Yacc, Flex, Bison, NLP, Natural Language Processing, WordNet, Part-of-Speech Tagging, Yacc Grammar for English, English parser, parsing English, Linguistics, Phrase Structure Grammar, X-bar Theory, POS, PSG, BNF, CFG, GLR