AuthorProfiling

NLP project: classification of time period and author age in fiction GIT REPO LINK: https://github.com/lizzij/AuthorProfiling

Reports

Proposal
Progress Report
Final Report

Data Preprocessing (to get clean data, skip down to the links found under Data)

Navigating Gutenberg

Get text files in English Language (ISO code EN)

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"

Get Catalog Data The complete Project Gutenberg catalog is available in RDF/XML Format. This file is a tar archive that contains one RDF file for each book. The RDF is based on the DCMI recommendation Since the file size is too large for git/github, here's the link to download the catalog.

For instance, here's an extraction from the catalogue, note that in the first row, the text ID is 1342

<pgterms:etext rdf:ID="etext1342">
  <dc:publisher>&pg;</dc:publisher>
  <dc:title rdf:parseType="Literal">Pride and Prejudice</dc:title>
  <dc:creator rdf:parseType="Literal">Austen, Jane, 1775-1817</dc:creator>
  <pgterms:friendlytitle rdf:parseType="Literal">Pride and Prejudice by Jane Austen</pgterms:friendlytitle>
  <dc:language><dcterms:ISO639-2><rdf:value>en</rdf:value></dcterms:ISO639-2></dc:language>
  <dc:subject>
    <rdf:Bag>
      <rdf:li><dcterms:LCSH><rdf:value>Young women -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>England -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Domestic fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Love stories</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Sisters -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Social classes -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Courtship -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
    </rdf:Bag>
  </dc:subject>
  <dc:subject><dcterms:LCC><rdf:value>PR</rdf:value></dcterms:LCC></dc:subject>
  <dc:created><dcterms:W3CDTF><rdf:value>1998-06-01</rdf:value></dcterms:W3CDTF></dc:created>
  <pgterms:downloads><xsd:nonNegativeInteger><rdf:value>38933</rdf:value></xsd:nonNegativeInteger></pgterms:downloads>
  <dc:rights rdf:resource="&lic;" />
</pgterms:etext>

To get the book from the catalog above, use:

  wget "http://www.gutenberg.org/files/1342/1342-0.txt"

Canonical URLs to txt format of the Books
Canonical URLs for Authors
- http://www.gutenberg.org/authors/Jane_Austen
Note that audio, and non-English documents should be excluded

Generate Standard Corpus and Metadata

generate the corpus locally here

git clone https://github.com/pgcorpus/gutenberg.git

enter the newly created gutenberg directory

cd gutenberg

To install any missing dependencies, just run

pip install -r requirements.txt

To get a local copy of the PG data, just run

python get_data.py

This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...). get_data.py.

To process all the data in the raw/ directory, run

python process_data.py

This will fill in the text/, tokens/ and counts/ folders.

Get Age and Time Period from Wikipedia

Begin by installing wikipedia:

$ pip install wikipedia

Run wiki.py
- When prompted to input query, type the query and hit Enter (for instance, try "jane austen")
- Returns a summary from the Wikipedia page containing the birth and publication data of main authors

Clean Data, Tag Categories

Move text/ to the same directory rename metadata.csv to clean_all.csv
Run clean.ipynb

Data

first 200 cleaned (57 books)
first 600 cleaned (133 books)
first 1200 cleaned (238 books)
first 1200, 5000-6500 (328 books)
first 1200, 5000-7500 (380 books)
first 1200, 5000-8500 (412 books)
first 3000, 5000-6500 (582 books)
first 3500, 5000-8500 (621 books)
first 8500 (711 books)

Running the Code

Select one of the above datasets (longer will take significantly longer to run) and download it.
The code to extract features, create a random forest model, and evaluate said model is located in feature_tagging.ipynb
If you are using a dataset and csv located somewhere other than the clean_data directory (i.e., one of the downloaded datasets), change the root_dir and csv_name variables. After this, all cells in the notebook can be run in sequence, without altering any variables
After generating the feature vectors, the resulting data can be dumped into csv files for later examination. This code can be found in the sixth cell, right before running the random forest model.
More specific information about each cell and each function can be found in comments throughout the notebook.
Note that this code requires the Stanford POS Tagger and NER Extractor in order to work. These can be found on our git page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AuthorProfiling

Reports

Data Preprocessing (to get clean data, skip down to the links found under Data)

Navigating Gutenberg

Generate Standard Corpus and Metadata

Get Age and Time Period from Wikipedia

Clean Data, Tag Categories

Data

Running the Code

Files

README.md

Latest commit

History

README.md

File metadata and controls

AuthorProfiling

Reports

Data Preprocessing (to get clean data, skip down to the links found under Data)

Navigating Gutenberg

Generate Standard Corpus and Metadata

Get Age and Time Period from Wikipedia

Clean Data, Tag Categories

Data

Running the Code