The `preprocess.py` script

The preprocess.py script is available at https://github.com/whatevery1says/preprocessing/blob/master/preprocess.py.

This script is only for preprocessing from the command line. It performs the following algorithm:

Reads the JSON manifest(s) into a spaCy nlp object.
Removes properties from the manifest, if specified.
Generates a table of spaCy nlp features, sorts it, and adds it without indexes to the manifest. The structure is a list of lists.
Creates a bag of terms dict (not including punctuation and line breaks) and adds it to the manifest.
Adds any additional specified properties (e.g. stems or ngrams) as lists to the manifest.
Adds a list of the document's readability scores to the manifest.
Adds the total word count (skipping punctuation and line breaks) to the manifest.
Adds the language model metadata.
Saves the new manifest over the old one.

This entire process took between 3-4 seconds for 11 files on my laptop.

The command line arguments are as follows:

--path (required): The file path to the directory containing the JSON manifest file. The script should walk through subdirectories.
--filename (required): The name of the JSON manifest file .json with extension.
--property (required): The name of the JSON property to be preprocessed.
--add-properties (optional): A comma-separated list of properties to be added to the manifest file.
--remove-properties (optional): A comma-separated list of properties to be removed from the manifest file.

Preprocessing a single file

Sample commands

python preprocess.py --path=data --filename=2010_10_humanities_student_major_5_askreddit.json --property=content_scrubbed

python preprocess.py --path=data --filename=2010_10_humanities_student_major_5_askreddit.json --property=content --remove-properties=content_scrubbed

Preprocessing a directory of files

Sample commands

python preprocess.py --path=data --property=content_scrubbed

python preprocess.py --path=data --property=content --remove-properties=content_scrubbed

To Do

Some fine tuning may be needed for the language model.
Try switching to the larger spaCy language models.
WE1S windowed ngrams need to be added. Right now only normal ngrams work.

Notes:

To use the large language model first run you first need to install it on the command line with python -m spacy download en_core_web_lg. I haven't done this because it would take up a lot of space on my laptop for about a 1% improvement in accuracy. But we could do it on the server. Once it is installed, you just change the model configuration to 'en_core_web_lg' in processing.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!