- Set up a conda environment:
# OPTION 1 (recommended): using env.yml
conda env create -f env.yml
# OPTION 2: using requirements.txt
conda create --name amalgum python=3.7
conda activate amalgum
pip install -r requirements.txt
-
(optional) If you have CUDA-capable hardware, add CUDA support:
conda install "pytorch<1.6" torchvision cudatoolkit -c pytorch
-
Download
punkt
:python -c "import nltk; nltk.download('punkt')"
-
(For Windows platforms), Download and install the 64-bit JRE and setup the JAVA_HOME environment variable with the location of this JRE (typically in C:\Program Files\Java<your jre folder>)
Invoke nlp_controller.py on the tiny subset to ensure the pipeline is working properly:
python nlp_controller.py target -i out_tiny
- Make a new file in
nlp_modules
-
Make a subclass of
NLPModule
.- You will need to implement the methods
__init__
, the constructortest_dependencies
, which should be used to download any static files (e.g. data, serialized models) that are required for your module's operationrun
, which is the method that the controller will use to invoke your module.
- In addition, you will also need to use the class attributes
requires
andprovides
to declare what kinds of NLP processing your module will expect, and what kind of processing it will provide, respectively, expressed using values of thePipelineDep
enum. (For instance, for a POS tagger,requires = (PipelineDep.TOKENIZE,)
, andprovides = PipelineDep.POS_TAG
.) - The remaining methods,
process_files
andprocess_files_multiformat
, are convenience functions that you should consider using in your implementation ofrun
. - See the TreeTagger POS tagging module for a small example.
- You will need to implement the methods
-
Register your module in
nlp_controller.py
-
Depending on what's appropriate, either add your module to
nlp_controller.py
's--modules
flag's default value, or invokenlp_controller.py
with your module included in--modules
.