Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Mordecai docs and tests (closes #93) #98

Merged
merged 5 commits into from
Jul 6, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 33 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,43 @@ phoenix_pipeline

Turning news into events since 2014.

This system links a series of Python programs to convert the files which have been
downloaded by a [web scraper](https://github.com/openeventdata/scraper) to coded event data which is uploaded to a web site
designated in the config file. The system processes a single day of information, but this
can be derived from multiple text files. The pipeline also implements a filter for
source URLs as defined by the keys in the `source_keys.txt` file. These keys
correspond to the `source` field in the MongoDB instance.
This system links a series of Python programs to convert the files which have
been downloaded by a [web scraper](https://github.com/openeventdata/scraper) to
coded event data which is uploaded to a web site designated in the config file.
The system processes a single day of information, but this can be derived from
multiple text files. The pipeline also implements a filter for source URLs as
defined by the keys in the `source_keys.txt` file. These keys correspond to the
`source` field in the MongoDB instance.

For more information please visit the [documentation](http://phoenix-pipeline.readthedocs.org/en/latest/).

## Requirements

The pipeline requires either
[Petrarch](https://github.com/openeventdata/petrarch) or
[Petrarch2](https://github.com/openeventdata/petrarch2) to be installed. Both
are Python programs and can be installed from Github using pip.

The pipeline assumes that stories are stored in a MongoDB in a particular
format. This format is the one used by the OEDA news RSS scraper. See [the
code](https://github.com/openeventdata/scraper/blob/master/mongo_connection.py)
for details on it structures stories in the Mongo. Using this pipeline with
differently formatted databases will require changing field names throughout
the code. The pipeline also requires that stories have been parsed with
Stanford CoreNLP. See the [simple and
stable](https://github.com/openeventdata/stanford_pipeline) way to do this, or
the [experimental distributed](https://github.com/oudalab/biryani) approach.

The pipeline requires one of two geocoding systems to be running: CLIFF-CLAVIN
or Mordecai. For CLIFF, see a VM version
[here](https://github.com/ahalterman/CLIFF-up) or a Docker container version
[here](https://github.com/openeventdata/cliff_container). For Mordecai, see the
setup instructions [here](https://github.com/openeventdata/mordecai). The
version of the pipeline deployed in production currently uses CLIFF/CLAVIN, but
future development will focus on improvements to Mordecai.

##Running

To run the program:

python pipeline.py
`python pipeline.py`
13 changes: 13 additions & 0 deletions tests/test_geolocation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from bson.objectid import ObjectId
import datetime
import sys
import os
sys.path.append(os.path.dirname(os.path.realpath(__file__)) + "/../")
import geolocation
import utilities

def test_geo_config():
server_details, geo_details, file_details, petrarch_version = utilities.parse_config('PHOX_config.ini')
geo_keys = geo_details._asdict().keys()
assert geo_keys == ['geo_service', 'cliff_host', 'cliff_port', 'mordecai_host', 'mordecai_port']