mpqa_features.pickle
is a serialized Pandas DataFrame with about 8000 labeled feature vectors that were derived from the annotated MPQA Corpus to model phrase and sentence level sentiment (polarity) in news and editorial content.
The idea is to use "subjectivity clues" - i.e. words that typically indicate expression of subjective opinion or judgement - that have been tagged with a prior polarity (i.e. the sentiment they convey in the absence of any context - neutral, positive, negative or both) to identify text passages that are likely to contain subjective expression. The subjectivity clues were taken from the MPQA Subjectivity Lexicon.
The subjectivity clues were looked for in the approx. 700 documents in the extended MPQA Corpus. The corpus annotations were used to determine the contextual polarity for each of occurrence. See Wilson, Wiebe, and Hoffmann (2005).
The DataFrame in mpqa_features.pickle
has the following structure:
The index consists of tuples of the form (path, document, count)
where path
and document
identify the corpus document from which the record was generated and count
is a running number within the document.
word
: int- The subjectivity clue (id in the spaCy vocabulary)
word_
: string- The subjectivity clue (i.e. an occurrence of one of the entries in the MPQA Subjectivity Lexicon)
pos
: int- Part-of-speech (id as assigned by spaCy as a
token
'spos
attribute) pos_
: string- Part-of-speech tag (abbreviation as assigned by spaCy as a
token
'spos_
attribute) before
: int- Word preceding the subjectivity clue (vocabulary id)
before_
: string- Word preceding the subjectivity clue
after
: int- Word following the subjectivity clue (vocabulary id)
after_
: string- Word following the subjectivity clue
context
: tuple of ints- (before, word, after) as vocabulary ids (this is redundant with the individual tokens but helps for visually exploring the data set).
context_
: tuple if strings- (before, word, after)
pre_neg
: boolean- True if there is a negating word within four tokens ahead of the subjectivity clue (not counting double negations such as "not only")
post_neg
: boolean- True if there is a negating word within four tokens after of the subjectivity clue
pri_pol
: string- Prior polarity as given in the MPQA Subjectivity Lexicon (possible values are 'positive', 'negative', 'both', and 'neutral')
rel
: string- Reliability class as given in the MPQA Subjectivity Lexicon (possible values are 'strongsubj', 'weaksubj')
c_pol
: float- Contextual polarity, derived from the contextual polarity and intensity entries in the MPQA Corpus: uses 1.0 for positive and -1.0 for neagtive polarity (0 for neutral) and multiplies with the intensity annotation (0 for low, 0.75 for medium, 1.5 for high, and 2.0 for extreme).
is_int
: boolean- True if the subjectivity clue itself is an intensifier. The list of itensifiers is taken from the MPQA Arguing Lexicon.
prec_int
: boolean- True if subjectivity clue is preceded by an intensifier
prec_adj
: boolean- True if subjectivity clue is preceded by an adjective
prec_adv
: boolean- True if subjectivity clue is preceded by an adverb
topic
: string- Topic of the respective article (if available in the corpus annotations)
pword
: int- "Packed id" for
word
(the subjectivity clue). Between theword
,before
andafter
vocabulary ids there are 3,691 different ids, taken from a vocabulary with about 300,000 entries. The "packed" values map these ids onto consecutive integers between 0 and 3,690. pbefore
: int- Packed id for
before
pbafter
: int- Packed id for
after
The python script mpqa.py
was used to construct the labeled feature vectors in mpqa_features.pickle
. Run:
python mpqa mkfeat -h
for instructions. Running the script requires (see Licenses for Corpus Content and Annotations):
- Third party python packages pandas and spaCy.
- The annotated MPQA Corpus.
- A 'doclist' file that lists all documents to be included in the data set (concatenate and dedupe the partially overlapping doclists that come with MPQA Corpus or use the
doclist.combinedUnique
file in this repo). - The MPQA Subjectivity Lexicon (provided as
subjclues.tff
in this repo). - A list of intensifiers, available in the MPQA Arguing Lexicon (file
intensifiers.tff
).
To use the mpqa
module from within your own script follow this example:
from __future__ import print_function import pandas as pd import mpqa df = pd.DataFrame(columns=mpqa.FEAT_COLS) for path, fname, topic in mpqa.iter_docs('doclist.combinedUnique'): print(path, fname) doc = mpqa.Doc( mpqa_dir='database.mpqa.2.0', path=path, fname=fname, topic=topic, sc_path='subjclues.tff', int_path='intensifiers.tff') df = df.append(doc.feat_df) sparse_cols = ['word', 'before', 'after'] pack_cols = mpqa.pack_df(df, sparse_cols) for c in sparse_cols: df['p' + c] = pack_cols[c]
This assumes that you have downloaded and extracted the MPQA Corpus to database.mpqa.2.0
. The resulting DataFrame df
will be the same as the one that can be obtained by unpickling mpqa_features.pickle
.
The download site for the MPQA Corpus and annotations states the following licensing terms:
The annotations in this data collection are copyrighted by the MITRE Corporation. User acknowledges and agrees that: (i) as between User and MITRE, MITRE owns all the right, title and interest in the Annotated Content, unless expressly stated otherwise; (ii) nothing in this Agreement shall confer in User any right of ownership in the Annotated Content; and (iii) User is granted a non-exclusive, royalty free, worldwide license (with no right to sublicense) to use the Annotated Content solely for academic and research purposes. This Agreement is governed by the law of the Commonwealth of Massachusetts and User agrees to submit to the exclusive jurisdiction of the Massachusetts courts.
Note: The textual news documents annotated in this corpus have been collected from a wide range of sources and are not copyrighted by the MITRE Corporation. The user acknowledges that the use of these news documents is restricted to research and/or academic purposes only.
The MPQA Subjectivity Lexicon and the MPQA Arguing Lexicon are provided under a GNU General Public License.