Skip to content

broadwell/intertext

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intertext

Detect and visualize text reuse.

Intertext combines machine learning with interactive data visualizations to surface intertextual patterns in large text collections. The text processing is based on minhashing vectorized strings, and the web viewer is based on interactive javascript components.

App preview

Dependencies

This application uses MongoDB as a database. You can install and start MongoDB on OSX with the following:

brew install mongodb
brew services start mongodb

This app also uses Node.js as a web server. You can install Node on OSX with the following command:

brew install node

Quickstart

Once the dependencies outlined above are installed, you can run:

# clone the application source code
git clone https://github.com/YaleDHLab/intertext

# install the Python dependencies
cd intertext && pip install -r requirements.txt --user

# install the node dependencies
npm install --no-optional

# detect reuse in the included sample documents
npm run detect-reuse

# start the web server
npm run production

If you open a web browser to localhost:7092, you will be able to browse discovered intertexts.

Processing New Data

To process new data, you need to install the app dependencies, then replace the files in data/texts with your text files and replace the metadata file in data/metadata with a new metadata file. Make sure your new text files and metadata files are in the same format as the sample text and metadata files.

Once your files are in place, you can identify intertexts in the data by running:

npm run detect-reuse

After processing your texts, you can examine the discovered text reuse by running:

npm run production

Then navigate to localhost:7092 and search for an author or text of interest.

config.json

The following values within config.json control the way Intertext discovers text reuse:

Field Remarks
infiles A glob path to the files to be searched for text reuse
metadata A path to the metadata file describing each input file
xml_tag The XML node from which to extract input text (if applicable)
max_cores The maximum number of cpu cores to use during processing
step Words to skip when sliding each window
window_size Increasing this lowers recall but finds more significant matches
*n_permutations Increasing this raises recall but lowers speed
*hashband_length Increasing this lowers recall but raises speed
*min_similarity Increasing this raises precision but lowers recall
* = essential analytic parameter

Sample config.json file:

{
  "infiles": "data/texts/*.txt",
  "metadata": "data/metadata/metadata.json",
  "xml_tag": false,
  "max_cores": 8,
  "step": 4,
  "window_size": 14,
  "n_permutations": 256,
  "hashband_length": 3,
  "min_similarity": 0.65
}

metadata.json

Each corpus must also have a metadata.json file that details metadata for each input file. Each input file should have one top-level key in the metadata file, and each of those keys can have any or all of the following optional attributes (example below):

Field Remarks
author Author of the text
title Title of the text
year Year in which text was published
url Deeplink to a remote server with the text (or related materials)
image Image of the author in src/assets/images/authors or on remote server

All metadata fields are optional, though all are expressed somewhere in the browser interface.

Sample metadata.json file

{
  "34360.txt": {
    "author": "Thomas Gray",
    "title": "An Elegy wrote in a Country Churchyard.",
    "year": 1751,
    "url": "http://spenserians.cath.vt.edu/TextRecord.php?action=GET&textsid=34360",
    "image": "http://www.poemofquotes.com/thomasgray/thomas-gray.jpg"
  },
  "37519.txt": {
    "author": "Anonymous",
    "title": "Elegy written in Saint Bride's Church-Yard.",
    "year": 1769,
    "url": "http://spenserians.cath.vt.edu/TextRecord.php?action=GET&textsid=37519",
    "image": "src/assets/images/authors/default-headshot.jpg"
  }
}

Deploying on AWS

The following covers steps you can take to deploy this application on an Amazon Linux AMI on AWS.

While creating the instance, add the following Custom TCP Ports to the default security settings:

Port Range Source Description
80 0.0.0.0/0, ::/0 HTTP
443 0.0.0.0/0, ::/0 HTTPS
27017 0.0.0.0/0, ::/0 MongoDB

After creating and ssh-ing to the instance, you can install all application dependencies, process the sample data, and start the web server with the following commands.

sudo yum update -y
sudo yum groupinstall "Development Tools" -y

##
# Node
##

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.32.0/install.sh | bash
. ~/.nvm/nvm.sh
nvm install 6.10.0
node -v

##
# Mongo
##

sudo touch /etc/yum.repos.d/mongodb-org-3.4.repo
sudo vim /etc/yum.repos.d/mongodb-org-3.4.repo

# paste the following:
[mongodb-org-3.4]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/3.4/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-3.4.asc

sudo yum install -y mongodb-org
sudo service mongod start
sudo chkconfig mongod on

##
# Python dependencies
##

sudo yum install libxml2-devel libxslt-devel python-devel -y
wget https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
bash Anaconda2-4.1.1-Linux-x86_64.sh

# accept the license agreement and default install location
source ~/.bashrc
which conda
rm Anaconda2-4.1.1-Linux-x86_64.sh

# create a virtual environment for your Python dependencies
conda create --name 3.5 python=3.5
source activate 3.5

# obtain app source and install Python dependencies
git clone https://github.com/YaleDHLab/intertext
cd intertext
pip install -r requirements.txt --user

##
# Intertext
##

# install node dependencies
npm install

# process texts
npm run detect-reuse

# start the server
npm run production

After running these steps (phew!), you should be able to see the application at http://YOUR_INSTANCE_IP:7092. To make the service run on a different port, specify a different port in server/config.json.

To forward requests for http://YOUR_INSTANCE_IP to port 7092, run:

sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 7092

Then users can see your application at http://YOUR_INSTANCE_IP without having to state a port.

About

Detect and visualize text reuse

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 72.6%
  • CSS 15.8%
  • Python 11.5%
  • HTML 0.1%