AuthorProfiling

NLP project: classification of time period and author age in fiction GIT REPO LINK: https://github.com/lizzij/AuthorProfiling

Reports

Data Preprocessing (to get clean data, skip down to the links found under Data)

Navigating Gutenberg

Get text files in English Language (ISO code EN)

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"

Get Catalog Data The complete Project Gutenberg catalog is available in RDF/XML Format. This file is a tar archive that contains one RDF file for each book. The RDF is based on the DCMI recommendation Since the file size is too large for git/github, here's the link to download the catalog.

For instance, here's an extraction from the catalogue, note that in the first row, the text ID is 1342

<pgterms:etext rdf:ID="etext1342">
  <dc:publisher>&pg;</dc:publisher>
  <dc:title rdf:parseType="Literal">Pride and Prejudice</dc:title>
  <dc:creator rdf:parseType="Literal">Austen, Jane, 1775-1817</dc:creator>
  <pgterms:friendlytitle rdf:parseType="Literal">Pride and Prejudice by Jane Austen</pgterms:friendlytitle>
  <dc:language><dcterms:ISO639-2><rdf:value>en</rdf:value></dcterms:ISO639-2></dc:language>
  <dc:subject>
    <rdf:Bag>
      <rdf:li><dcterms:LCSH><rdf:value>Young women -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>England -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Domestic fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Love stories</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Sisters -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Social classes -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
      <rdf:li><dcterms:LCSH><rdf:value>Courtship -- Fiction</rdf:value></dcterms:LCSH></rdf:li>
    </rdf:Bag>
  </dc:subject>
  <dc:subject><dcterms:LCC><rdf:value>PR</rdf:value></dcterms:LCC></dc:subject>
  <dc:created><dcterms:W3CDTF><rdf:value>1998-06-01</rdf:value></dcterms:W3CDTF></dc:created>
  <pgterms:downloads><xsd:nonNegativeInteger><rdf:value>38933</rdf:value></xsd:nonNegativeInteger></pgterms:downloads>
  <dc:rights rdf:resource="&lic;" />
</pgterms:etext>

To get the book from the catalog above, use:

  wget "http://www.gutenberg.org/files/1342/1342-0.txt"

Canonical URLs to txt format of the Books
Canonical URLs for Authors
- http://www.gutenberg.org/authors/Jane_Austen
Note that audio, and non-English documents should be excluded

Generate Standard Corpus and Metadata

generate the corpus locally here

git clone https://github.com/pgcorpus/gutenberg.git

enter the newly created gutenberg directory

cd gutenberg

To install any missing dependencies, just run

pip install -r requirements.txt

To get a local copy of the PG data, just run

python get_data.py

This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...). get_data.py.

To process all the data in the raw/ directory, run

python process_data.py

This will fill in the text/, tokens/ and counts/ folders.

Get Age and Time Period from Wikipedia

Begin by installing wikipedia:

$ pip install wikipedia

Run wiki.py
- When prompted to input query, type the query and hit Enter (for instance, try "jane austen")
- Returns a summary from the Wikipedia page containing the birth and publication data of main authors

Clean Data, Tag Categories

Move text/ to the same directory rename metadata.csv to clean_all.csv
Run clean.ipynb

Data

Running the Code

Select one of the above datasets (longer will take significantly longer to run) and download it.
The code to extract features, create a random forest model, and evaluate said model is located in feature_tagging.ipynb
If you are using a dataset and csv located somewhere other than the clean_data directory (i.e., one of the downloaded datasets), change the root_dir and csv_name variables. After this, all cells in the notebook can be run in sequence, without altering any variables
After generating the feature vectors, the resulting data can be dumped into csv files for later examination. This code can be found in the sixth cell, right before running the random forest model.
More specific information about each cell and each function can be found in comments throughout the notebook.
Note that this code requires the Stanford POS Tagger and NER Extractor in order to work. These can be found on our git page.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
clean_data		clean_data
reports		reports
.gitignore		.gitignore
Classifiers.ipynb		Classifiers.ipynb
README.md		README.md
english-left3words-distsim.tagger		english-left3words-distsim.tagger
english.all.3class.distsim.crf.ser.gz		english.all.3class.distsim.crf.ser.gz
feature_tagging.ipynb		feature_tagging.ipynb
requirements.txt		requirements.txt
stanford-ner.jar		stanford-ner.jar
stanford-postagger.jar		stanford-postagger.jar
wiki.py		wiki.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AuthorProfiling

Reports

Data Preprocessing (to get clean data, skip down to the links found under Data)

Navigating Gutenberg

Generate Standard Corpus and Metadata

Get Age and Time Period from Wikipedia

Clean Data, Tag Categories

Data

Running the Code

About

Releases

Packages

Contributors 3

Languages

lizzij/AuthorProfiling

Folders and files

Latest commit

History

Repository files navigation

AuthorProfiling

Reports

Data Preprocessing (to get clean data, skip down to the links found under Data)

Navigating Gutenberg

Generate Standard Corpus and Metadata

Get Age and Time Period from Wikipedia

Clean Data, Tag Categories

Data

Running the Code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages