Skip to content

Latest commit

 

History

History
121 lines (105 loc) · 6.18 KB

README.md

File metadata and controls

121 lines (105 loc) · 6.18 KB

AuthorProfiling

NLP project: classification of time period and author age in fiction GIT REPO LINK: https://github.com/lizzij/AuthorProfiling

Reports

Data Preprocessing (to get clean data, skip down to the links found under Data)

Navigating Gutenberg

  • Get text files in English Language (ISO code EN)
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
  • Get Catalog Data The complete Project Gutenberg catalog is available in RDF/XML Format. This file is a tar archive that contains one RDF file for each book. The RDF is based on the DCMI recommendation Since the file size is too large for git/github, here's the link to download the catalog.

    • For instance, here's an extraction from the catalogue, note that in the first row, the text ID is 1342
    <pgterms:etext rdf:ID="etext1342">
      <dc:publisher>&pg;</dc:publisher>
      <dc:title rdf:parseType="Literal">Pride and Prejudice</dc:title>
      <dc:creator rdf:parseType="Literal">Austen, Jane, 1775-1817</dc:creator>
      <pgterms:friendlytitle rdf:parseType="Literal">Pride and Prejudice by Jane Austen</pgterms:friendlytitle>
      <dc:language><dcterms:ISO639-2><rdf:value>en</rdf:value></dcterms:ISO639-2></dc:language>
      <dc:subject>
        <rdf:Bag>
          <rdf:li><dcterms:LCSH><rdf:value>Young women -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
          <rdf:li><dcterms:LCSH><rdf:value>England -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
          <rdf:li><dcterms:LCSH><rdf:value>Domestic fiction</rdf:value></dcterms:LCSH></rdf:li>
          <rdf:li><dcterms:LCSH><rdf:value>Love stories</rdf:value></dcterms:LCSH></rdf:li>
          <rdf:li><dcterms:LCSH><rdf:value>Sisters -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
          <rdf:li><dcterms:LCSH><rdf:value>Social classes -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
          <rdf:li><dcterms:LCSH><rdf:value>Courtship -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
        </rdf:Bag>
      </dc:subject>
      <dc:subject><dcterms:LCC><rdf:value>PR</rdf:value></dcterms:LCC></dc:subject>
      <dc:created><dcterms:W3CDTF><rdf:value>1998-06-01</rdf:value></dcterms:W3CDTF></dc:created>
      <pgterms:downloads><xsd:nonNegativeInteger><rdf:value>38933</rdf:value></xsd:nonNegativeInteger></pgterms:downloads>
      <dc:rights rdf:resource="&lic;" />
    </pgterms:etext>
    • To get the book from the catalog above, use:
      wget "http://www.gutenberg.org/files/1342/1342-0.txt"
  • Canonical URLs to txt format of the Books

  • Canonical URLs for Authors

  • Note that audio, and non-English documents should be excluded

Generate Standard Corpus and Metadata

  • generate the corpus locally here
git clone https://github.com/pgcorpus/gutenberg.git
  • enter the newly created gutenberg directory
cd gutenberg
  • To install any missing dependencies, just run
pip install -r requirements.txt
  • To get a local copy of the PG data, just run
python get_data.py

This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...). get_data.py.

  • To process all the data in the raw/ directory, run
python process_data.py

This will fill in the text/, tokens/ and counts/ folders.

Get Age and Time Period from Wikipedia

  • Begin by installing wikipedia:
$ pip install wikipedia
  • Run wiki.py
    • When prompted to input query, type the query and hit Enter (for instance, try "jane austen")
    • Returns a summary from the Wikipedia page containing the birth and publication data of main authors

Clean Data, Tag Categories

  • Move text/ to the same directory rename metadata.csv to clean_all.csv
  • Run clean.ipynb

Data

Running the Code

  • Select one of the above datasets (longer will take significantly longer to run) and download it.
  • The code to extract features, create a random forest model, and evaluate said model is located in feature_tagging.ipynb
  • If you are using a dataset and csv located somewhere other than the clean_data directory (i.e., one of the downloaded datasets), change the root_dir and csv_name variables. After this, all cells in the notebook can be run in sequence, without altering any variables
  • After generating the feature vectors, the resulting data can be dumped into csv files for later examination. This code can be found in the sixth cell, right before running the random forest model.
  • More specific information about each cell and each function can be found in comments throughout the notebook.
  • Note that this code requires the Stanford POS Tagger and NER Extractor in order to work. These can be found on our git page.