Feature Set based on MPQA Corpus

mpqa_features.pickle is a serialized Pandas DataFrame with about 8000 labeled feature vectors that were derived from the annotated MPQA Corpus to model phrase and sentence level sentiment (polarity) in news and editorial content.

Basic Idea

The idea is to use "subjectivity clues" - i.e. words that typically indicate expression of subjective opinion or judgement - that have been tagged with a prior polarity (i.e. the sentiment they convey in the absence of any context - neutral, positive, negative or both) to identify text passages that are likely to contain subjective expression. The subjectivity clues were taken from the MPQA Subjectivity Lexicon.

The subjectivity clues were looked for in the approx. 700 documents in the extended MPQA Corpus. The corpus annotations were used to determine the contextual polarity for each of occurrence. See Wilson, Wiebe, and Hoffmann (2005).

Data Set

The DataFrame in mpqa_features.pickle has the following structure:

Index

The index consists of tuples of the form (path, document, count) where path and document identify the corpus document from which the record was generated and count is a running number within the document.

Columns

word: int: The subjectivity clue (id in the spaCy vocabulary)
word_: string: The subjectivity clue (i.e. an occurrence of one of the entries in the MPQA Subjectivity Lexicon)
pos: int: Part-of-speech (id as assigned by spaCy as a token's pos attribute)
pos_: string: Part-of-speech tag (abbreviation as assigned by spaCy as a token's pos_ attribute)
before: int: Word preceding the subjectivity clue (vocabulary id)
before_: string: Word preceding the subjectivity clue
after: int: Word following the subjectivity clue (vocabulary id)
after_: string: Word following the subjectivity clue
context: tuple of ints: (before, word, after) as vocabulary ids (this is redundant with the individual tokens but helps for visually exploring the data set).
context_: tuple if strings: (before, word, after)
pre_neg: boolean: True if there is a negating word within four tokens ahead of the subjectivity clue (not counting double negations such as "not only")
post_neg: boolean: True if there is a negating word within four tokens after of the subjectivity clue
pri_pol: string: Prior polarity as given in the MPQA Subjectivity Lexicon (possible values are 'positive', 'negative', 'both', and 'neutral')
rel: string: Reliability class as given in the MPQA Subjectivity Lexicon (possible values are 'strongsubj', 'weaksubj')
c_pol: float: Contextual polarity, derived from the contextual polarity and intensity entries in the MPQA Corpus: uses 1.0 for positive and -1.0 for neagtive polarity (0 for neutral) and multiplies with the intensity annotation (0 for low, 0.75 for medium, 1.5 for high, and 2.0 for extreme).
is_int: boolean: True if the subjectivity clue itself is an intensifier. The list of itensifiers is taken from the MPQA Arguing Lexicon.
prec_int: boolean: True if subjectivity clue is preceded by an intensifier
prec_adj: boolean: True if subjectivity clue is preceded by an adjective
prec_adv: boolean: True if subjectivity clue is preceded by an adverb
topic: string: Topic of the respective article (if available in the corpus annotations)
pword: int: "Packed id" for word (the subjectivity clue). Between the word, before and after vocabulary ids there are 3,691 different ids, taken from a vocabulary with about 300,000 entries. The "packed" values map these ids onto consecutive integers between 0 and 3,690.
pbefore: int: Packed id for before
pbafter: int: Packed id for after

Python Script

The python script mpqa.py was used to construct the labeled feature vectors in mpqa_features.pickle. Run:

python mpqa mkfeat -h

for instructions. Running the script requires (see Licenses for Corpus Content and Annotations):

Third party python packages pandas and spaCy.
The annotated MPQA Corpus.
A 'doclist' file that lists all documents to be included in the data set (concatenate and dedupe the partially overlapping doclists that come with MPQA Corpus or use the doclist.combinedUnique file in this repo).
The MPQA Subjectivity Lexicon (provided as subjclues.tff in this repo).
A list of intensifiers, available in the MPQA Arguing Lexicon (file intensifiers.tff).

To use the mpqa module from within your own script follow this example:

from __future__ import print_function
import pandas as pd
import mpqa

df = pd.DataFrame(columns=mpqa.FEAT_COLS)
for path, fname, topic in mpqa.iter_docs('doclist.combinedUnique'):
    print(path, fname)
    doc = mpqa.Doc(
            mpqa_dir='database.mpqa.2.0',
            path=path,
            fname=fname,
            topic=topic,
            sc_path='subjclues.tff',
            int_path='intensifiers.tff')
    df = df.append(doc.feat_df)

sparse_cols = ['word', 'before', 'after']
pack_cols = mpqa.pack_df(df, sparse_cols)
for c in sparse_cols:
    df['p' + c] = pack_cols[c]

This assumes that you have downloaded and extracted the MPQA Corpus to database.mpqa.2.0. The resulting DataFrame df will be the same as the one that can be obtained by unpickling mpqa_features.pickle.

Licenses for Corpus Content and Annotations

The download site for the MPQA Corpus and annotations states the following licensing terms:

The annotations in this data collection are copyrighted by the MITRE Corporation. User acknowledges and agrees that: (i) as between User and MITRE, MITRE owns all the right, title and interest in the Annotated Content, unless expressly stated otherwise; (ii) nothing in this Agreement shall confer in User any right of ownership in the Annotated Content; and (iii) User is granted a non-exclusive, royalty free, worldwide license (with no right to sublicense) to use the Annotated Content solely for academic and research purposes. This Agreement is governed by the law of the Commonwealth of Massachusetts and User agrees to submit to the exclusive jurisdiction of the Massachusetts courts.

Note: The textual news documents annotated in this corpus have been collected from a wide range of sources and are not copyrighted by the MITRE Corporation. The user acknowledges that the use of these news documents is restricted to research and/or academic purposes only.

The MPQA Subjectivity Lexicon and the MPQA Arguing Lexicon are provided under a GNU General Public License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
README.rst		README.rst
Train_and_Test.ipynb		Train_and_Test.ipynb
doclist.combinedUnique		doclist.combinedUnique
intensifiers.tff		intensifiers.tff
mpqa.py		mpqa.py
mpqa_features.pickle		mpqa_features.pickle
subjclues.tff		subjclues.tff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Set based on MPQA Corpus

Basic Idea

Data Set

Index

Columns

Python Script

Licenses for Corpus Content and Annotations

About

Releases

Packages

Contributors 2

Languages

ms8r/mpqa

Folders and files

Latest commit

History

Repository files navigation

Feature Set based on MPQA Corpus

Basic Idea

Data Set

Index

Columns

Python Script

Licenses for Corpus Content and Annotations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages