Skip to content

Latest commit

 

History

History
 
 

parser

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Parser

Run setup.sh to install dependencies and build the parser.

We assume that your input has the following format. There's one line per document and each document is a JSON object with a key and content field.

{ "item_id":"doc1", "content":"Here is the content of my document.\nAnd here's another line." }
{ "item_id":"doc2", "content":"Here's another document." }

You can run the NLP pipeline on 1 core as follows:

cat input.json | ./run.sh -i json -k "item_id" -v "content" > output.tsv

You can run the NLP pipeline on 16 cores as follows:

./run_parallel.sh -in="input.json" --parallelism=16 -i json -k "item_id" -v "content"

You can run the NLP pipeline as a REST service as follows:

./run.sh -p 8080

The output will be files in tsv-format that you can directly load into the database.

Setup

This package requires Java 8.