Intertext

Detect and visualize text reuse.

Intertext combines machine learning with interactive data visualizations to surface intertextual patterns in large text collections. The text processing is based on minhashing vectorized strings, and the web viewer is based on interactive javascript components.

Dependencies

This application uses MongoDB as a database. You can install and start MongoDB on OSX with the following:

brew install mongodb
brew services start mongodb

This app also uses Node.js as a web server. You can install Node on OSX with the following command:

brew install node

Quickstart

Once the dependencies outlined above are installed, you can run:

# clone the application source code
git clone https://github.com/YaleDHLab/intertext

# install the Python dependencies
cd intertext && pip install -r requirements.txt --user

# install the node dependencies
npm install --no-optional

# detect reuse in the included sample documents
npm run detect-reuse

# start the web server
npm run production

If you open a web browser to localhost:7092, you will be able to browse discovered intertexts.

Processing New Data

To process new data, you need to install the app dependencies, then replace the files in data/texts with your text files and replace the metadata file in data/metadata with a new metadata file. Make sure your new text files and metadata files are in the same format as the sample text and metadata files.

Once your files are in place, you can identify intertexts in the data by running:

npm run detect-reuse

After processing your texts, you can examine the discovered text reuse by running:

npm run production

Then navigate to localhost:7092 and search for an author or text of interest.

config.json

The following values within config.json control the way Intertext discovers text reuse:

Field	Remarks
infiles	A glob path to the files to be searched for text reuse
metadata	A path to the metadata file describing each input file
xml_tag	The XML node from which to extract input text (if applicable)
max_cores	The maximum number of cpu cores to use during processing
step	Words to skip when sliding each window
window_size	Increasing this lowers recall but finds more significant matches
*n_permutations	Increasing this raises recall but lowers speed
*hashband_length	Increasing this lowers recall but raises speed
*min_similarity	Increasing this raises precision but lowers recall
* = essential analytic parameter

Sample config.json file:

{
  "infiles": "data/texts/*.txt",
  "metadata": "data/metadata/metadata.json",
  "xml_tag": false,
  "max_cores": 8,
  "step": 4,
  "window_size": 14,
  "n_permutations": 256,
  "hashband_length": 3,
  "min_similarity": 0.65
}

metadata.json

Each corpus must also have a metadata.json file that details metadata for each input file. Each input file should have one top-level key in the metadata file, and each of those keys can have any or all of the following optional attributes (example below):

Field	Remarks
author	Author of the text
title	Title of the text
year	Year in which text was published
url	Deeplink to a remote server with the text (or related materials)
image	Image of the author in `src/assets/images/authors` or on remote server

All metadata fields are optional, though all are expressed somewhere in the browser interface.

Sample metadata.json file

{
  "34360.txt": {
    "author": "Thomas Gray",
    "title": "An Elegy wrote in a Country Churchyard.",
    "year": 1751,
    "url": "http://spenserians.cath.vt.edu/TextRecord.php?action=GET&textsid=34360",
    "image": "http://www.poemofquotes.com/thomasgray/thomas-gray.jpg"
  },
  "37519.txt": {
    "author": "Anonymous",
    "title": "Elegy written in Saint Bride's Church-Yard.",
    "year": 1769,
    "url": "http://spenserians.cath.vt.edu/TextRecord.php?action=GET&textsid=37519",
    "image": "src/assets/images/authors/default-headshot.jpg"
  }
}

Deploying on AWS

The following covers steps you can take to deploy this application on an Amazon Linux AMI on AWS.

While creating the instance, add the following Custom TCP Ports to the default security settings:

Port Range	Source	Description
80	0.0.0.0/0, ::/0	HTTP
443	0.0.0.0/0, ::/0	HTTPS
27017	0.0.0.0/0, ::/0	MongoDB

After creating and ssh-ing to the instance, you can install all application dependencies, process the sample data, and start the web server with the following commands.

sudo yum update -y
sudo yum groupinstall "Development Tools" -y

##
# Node
##

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.32.0/install.sh | bash
. ~/.nvm/nvm.sh
nvm install 6.10.0
node -v

##
# Mongo
##

sudo touch /etc/yum.repos.d/mongodb-org-3.4.repo
sudo vim /etc/yum.repos.d/mongodb-org-3.4.repo

# paste the following:
[mongodb-org-3.4]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/amazon/2013.03/mongodb-org/3.4/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-3.4.asc

sudo yum install -y mongodb-org
sudo service mongod start
sudo chkconfig mongod on

##
# Python dependencies
##

sudo yum install libxml2-devel libxslt-devel python-devel -y
wget https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
bash Anaconda2-4.1.1-Linux-x86_64.sh

# accept the license agreement and default install location
source ~/.bashrc
which conda
rm Anaconda2-4.1.1-Linux-x86_64.sh

# create a virtual environment for your Python dependencies
conda create --name 3.5 python=3.5
source activate 3.5

# obtain app source and install Python dependencies
git clone https://github.com/YaleDHLab/intertext
cd intertext
pip install -r requirements.txt --user

##
# Intertext
##

# install node dependencies
npm install

# process texts
npm run detect-reuse

# start the server
npm run production

After running these steps (phew!), you should be able to see the application at http://YOUR_INSTANCE_IP:7092. To make the service run on a different port, specify a different port in server/config.json.

To forward requests for http://YOUR_INSTANCE_IP to port 7092, run:

sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 7092

Then users can see your application at http://YOUR_INSTANCE_IP without having to state a port.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
intertext		intertext
server		server
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
package.json		package.json
requirements.txt		requirements.txt
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intertext

Dependencies

Quickstart

Processing New Data

config.json

metadata.json

Deploying on AWS

About

Releases

Packages

Languages

License

broadwell/intertext

Folders and files

Latest commit

History

Repository files navigation

Intertext

Dependencies

Quickstart

Processing New Data

config.json

metadata.json

Deploying on AWS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages