NLP project: classification of time period and author age in fiction GIT REPO LINK: https://github.com/lizzij/AuthorProfiling
- Get text files in English Language (ISO code EN)
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
-
Get Catalog Data The complete Project Gutenberg catalog is available in RDF/XML Format. This file is a tar archive that contains one RDF file for each book. The RDF is based on the DCMI recommendation Since the file size is too large for git/github, here's the link to download the catalog.
- For instance, here's an extraction from the catalogue, note that in the first row, the text ID is 1342
<pgterms:etext rdf:ID="etext1342"> <dc:publisher>&pg;</dc:publisher> <dc:title rdf:parseType="Literal">Pride and Prejudice</dc:title> <dc:creator rdf:parseType="Literal">Austen, Jane, 1775-1817</dc:creator> <pgterms:friendlytitle rdf:parseType="Literal">Pride and Prejudice by Jane Austen</pgterms:friendlytitle> <dc:language><dcterms:ISO639-2><rdf:value>en</rdf:value></dcterms:ISO639-2></dc:language> <dc:subject> <rdf:Bag> <rdf:li><dcterms:LCSH><rdf:value>Young women -- Fiction</rdf:value></dcterms:LCSH></rdf:li> <rdf:li><dcterms:LCSH><rdf:value>England -- Fiction</rdf:value></dcterms:LCSH></rdf:li> <rdf:li><dcterms:LCSH><rdf:value>Domestic fiction</rdf:value></dcterms:LCSH></rdf:li> <rdf:li><dcterms:LCSH><rdf:value>Love stories</rdf:value></dcterms:LCSH></rdf:li> <rdf:li><dcterms:LCSH><rdf:value>Sisters -- Fiction</rdf:value></dcterms:LCSH></rdf:li> <rdf:li><dcterms:LCSH><rdf:value>Social classes -- Fiction</rdf:value></dcterms:LCSH></rdf:li> <rdf:li><dcterms:LCSH><rdf:value>Courtship -- Fiction</rdf:value></dcterms:LCSH></rdf:li> </rdf:Bag> </dc:subject> <dc:subject><dcterms:LCC><rdf:value>PR</rdf:value></dcterms:LCC></dc:subject> <dc:created><dcterms:W3CDTF><rdf:value>1998-06-01</rdf:value></dcterms:W3CDTF></dc:created> <pgterms:downloads><xsd:nonNegativeInteger><rdf:value>38933</rdf:value></xsd:nonNegativeInteger></pgterms:downloads> <dc:rights rdf:resource="&lic;" /> </pgterms:etext>
- To get the book from the catalog above, use:
wget "http://www.gutenberg.org/files/1342/1342-0.txt"
-
Canonical URLs to txt format of the Books
-
Canonical URLs for Authors
-
Note that audio, and non-English documents should be excluded
- generate the corpus locally here
git clone https://github.com/pgcorpus/gutenberg.git
- enter the newly created
gutenberg
directory
cd gutenberg
- To install any missing dependencies, just run
pip install -r requirements.txt
- To get a local copy of the PG data, just run
python get_data.py
This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...).
get_data.py
.
- To process all the data in the
raw/
directory, run
python process_data.py
This will fill in the text/
, tokens/
and counts/
folders.
- Begin by installing wikipedia:
$ pip install wikipedia
- Run wiki.py
- When prompted to input query, type the query and hit Enter (for instance, try "jane austen")
- Returns a summary from the Wikipedia page containing the birth and publication data of main authors
- Move
text/
to the same directory renamemetadata.csv
toclean_all.csv
- Run
clean.ipynb
- first 200 cleaned (57 books)
- first 600 cleaned (133 books)
- first 1200 cleaned (238 books)
- first 1200, 5000-6500 (328 books)
- first 1200, 5000-7500 (380 books)
- first 1200, 5000-8500 (412 books)
- first 3000, 5000-6500 (582 books)
- first 3500, 5000-8500 (621 books)
- first 8500 (711 books)
- Select one of the above datasets (longer will take significantly longer to run) and download it.
- The code to extract features, create a random forest model, and evaluate said model is located in feature_tagging.ipynb
- If you are using a dataset and csv located somewhere other than the clean_data directory (i.e., one of the downloaded datasets), change the root_dir and csv_name variables. After this, all cells in the notebook can be run in sequence, without altering any variables
- After generating the feature vectors, the resulting data can be dumped into csv files for later examination. This code can be found in the sixth cell, right before running the random forest model.
- More specific information about each cell and each function can be found in comments throughout the notebook.
- Note that this code requires the Stanford POS Tagger and NER Extractor in order to work. These can be found on our git page.