Skip to content

Latest commit

 

History

History
151 lines (107 loc) · 7.51 KB

README.md

File metadata and controls

151 lines (107 loc) · 7.51 KB

Have you ever tried using word counts to analyze a collection of documents? Lots of important concepts get missed, since they don't appear as single words (unigrams). For example, the words "social" and "security" don't fully represent the concept "social security"; the words "New" and "York" don't really represent "New York." Phrasemachine identifies these sort of multiword phrases automatically so you can use them in text analysis. Here's how it works in Python.

>>> import phrasemachine
>>> text = "Barack Obama supports expanding social security."
>>> phrasemachine.get_phrases(text)
{'num_tokens': 7, 'counts': Counter({'barack obama': 1, 'social security': 1})}

For more details, see our paper: Bag of What?, or this slidedeck. By default, this package uses the (FilterFSA, k=8, SimpleNP) method from the paper.

The software only supports English texts.

Installation

We have implementations in both R and Python. For Python, install with:

pip install phrasemachine

For the R version, see the R vignette here.

Near duplicates and merging

You might notice that phrasemachine sometimes extracts nested phrases. For instance,

text = "The Omnibus Crime Control and Safe Streets Act of 1968 was signed into law by President Lyndon B. Johnson"
phrasemachine.get_phrases(text)

extracts 'lyndon b. johnson' and 'b. johnson'.

This is intentional: phrasemachine tries to extract all phrases that might be useful for downstream analysis. In some cases, you might want to try to merge similar, overlapping or cofererent terms. For strategies, see section 4.3.1 from our paper: Bag of What?

Can I use phrasemachine with spaCy or CoreNLP?

Yep! By default, phrasemachine depends on NLTK for part-of-speech tagging. But it can also be used with the higher accuracy spaCy tagger, or with Stanford CoreNLP. Here is an example with spaCy:

>>> import spacy
>>> import phrasemachine
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = nlp(u"Barack Obama supports expanding social security.")
>>> tokens = [token.text for token in doc]
>>> pos = [token.pos_ for token in doc]
>>> print(tokens)
['Barack', 'Obama', 'supports', 'expanding', 'social', 'security', '.']
>>> print(pos)
['PROPN', 'PROPN', 'VERB', 'VERB', 'ADJ', 'NOUN', 'PUNCT']
>>> phrasemachine.get_phrases(tokens=tokens, postags=pos)
{'num_tokens': 7, 'counts': Counter({'barack obama': 1, 'social security': 1})}

Notice that when you use a custom POS tagger from some other package, you pass a list of tokens and a list of POS tags to the get_phrases method in phrasemachine.py. If you are familiar and comfortable with POS tagging yourself, all you really need is the phrasemachine.py file.

What if I want the token indexes for phrases?

Phrasemachine supports this.

>>> tokens = ['Barack', 'Obama', 'supports', 'expanding', 'social', 'security', '.']
>>> pos = ['PROPN', 'PROPN', 'VERB', 'VERB', 'ADJ', 'NOUN', 'PUNCT']
>>> phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
{'num_tokens': 7, 'token_spans': [(0, 2), (4, 6)]}
>>> out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
>>> start,end = out['token_spans'].pop()
>>> tokens[start:end]
['social', 'security']

What tagsets are supported?

Different POS tagging schemes use different tagsets (i.e. possible POS tags). The python version of phrasemachine supports the following:

How is phrasemachine different from named-entity recognition?

If you've spent some time working with text data you've probably heard of named entities. Maybe you’ve used tools like StanfordCoreNLP or AlchemyAPI to extract entities from text. Phrasemachine is related but a little different. Instead of trying to just label, for example, people or places, it tries to extract all of the important noun phrases from documents. This includes names, but also more general concepts like "defense spending," "estate tax," or "car mechanic." The downside is it doesn't place phrases into categories like "New York"=LOCATION.

If you are familiar with the idea of a "bag of words" you can think of phrasemachine as finding extra phrases to place into this bag. For example, it can be used to find frequently occurring terms in political debates. Mathematically, its output can be used to augment the term-document matrix.

Phrasemachine is an elaboration of work from Justeston and Katz (1995); they found that many technical terms such as ''gaussian distribution'' matched a regular expression over the part of speech tags for a sequence of words. Researchers have found the approach useful in many different contexts.

phrasemachine was written by Abram Handler, Matthew J. Denny, and Brendan O'Connor.

More details can be found in this paper: "Bag of What? Simple Noun Phrase Extraction for Text Analysis," Handler, Denny, Wallach, and O'Connor, 2016; or, this slidedeck.

In the future, we will add discussion of the following:

  • twitter pos tagger
  • normalization (Barack Obama => barack obama)
  • tokenization
  • not just noun phrases (noun-verb? adj phrases, any coordinations, verb groups?)
  • custom regex

Repository structure

  • py/: the Python implementation
  • R/: the R implementation
  • fst/: the OpenFST/pyfst implementation, which is not packaged for use by default. It does the FullNP grammar as specified in the paper. Since the dependencies can be difficult to run, the main implementations above use what the paper calls SimpleNP grammar with the FilterFSA matching method.

Comparing R and Python implementations

The R and Python implementations of POS tagging currently rely on different libraries, and will thus give different results. However, given the same input POS tag sequences, both implementations will return identical results. To verify that this is the case, simply navigate to the R/comparison_tests directory, then run the run_POS_to_spans_test.sh shell script. This can be done using the following lines of code (assuming you are in the top level directory for this repo).

cd R/comparison_tests
bash run_POS_to_spans_test.sh

The script will produce a set of phrase spans using both implementations and print out any mismatches between the two sets of results.

Projects using phrasemachine

Email [email protected] to add your project to the list!

  • Adam Lauretig at Ohio State uses phrasemachine for his project, ''Do Casualties Change the Conversation?''.
  • A team at Northeastern uses phrasemachine to explore the ideology of journalists.

Acknowledgment

"phrasemachine" is named after Michael Heilman's "phraseomatic" script.