-
Notifications
You must be signed in to change notification settings - Fork 42
Classifying
Once TEES has been setup with configure.py and if the models have been installed, events and relations can be predicted. Assuming the preprocessing tools have also been installed, unparsed text can be used as input. To predict events/relations with TEES, classify.py is used. The program takes several arguments, and the mandatory ones are the following:
Argument | Description |
---|---|
-i, --input | The input data, which can be an interaction XML file, a txt ASCII file or an archive of BioNLP Shared Task documents |
-m, --model | The TEES model file. This can be either a user-built file, or one of the default models provided with TEES |
-o, --output | The stem of the output files |
Classification is very simple with a builtin model. To predict events corresponding to the BioNLP'11 GENIA task for some text, classify.py would be called with the following arguments
python classify.py -i [INPUT] -m GE11 -o [OUTPUT]
where "GE11" refers to the builtin "GE11-test" model. TEES will try to determine if preprocessing is needed, and if it is, will use the preprocessing pipeline to prepare the required analyses for GE11-type event extraction. The output will consist of several files, being usually at least a log file "OUTPUT-log.txt" and an interaction XML file "OUTPUT-pred.xml.gz". If preprocessing was needed, there is also a preprocessor output file "OUTPUT-preprocessed.xml.gz" and if the model is for one of the BioNLP Shared Tasks, a Shared Task format file "OUTPUT-events.tar.gz". If intermediate files need to be saved, a directory for these can be defined with the workdir argument "-w".
Normally preprocessing is disabled only when classifying a built-in corpus file. If your input is already parsed (or whatever the model file requires) preprocessing can be omitted with the argument
--omitSteps PREPROCESS
The classification program also has a demonstration mode where a single abstract from PubMed can be dynamically downloaded and classified as input. If an integer is passed as the input argument, classify.py attempts to download and classify an article abstract with that PubMed id.
The following table lists the models provided with TEES 2.0. They can be used with classify.py, with the -m parameter, with the optional suffix "-devel" or "-test". If no suffix is given, a "-test" model is assumed. Performance may vary slightly depending on the environment (e.g. 32 vs 64 bit) in which the program is used.
Model | Shared Task | Description | Devel F-score | Test F-score |
---|---|---|---|---|
GE | BioNLP'11 | GENIA Event Extraction | 55.20 / 52.81 / 38.01 (b) | 52.57 / ? / 28.12 (b) |
EPI | BioNLP'11 | Epigenetics and Post-translational Modifications | 56.11 | 54.16 |
ID | BioNLP'11 | Infectious Diseases | 50.98 | 53.37 |
BB | BioNLP'11 | Bacteria Biotopes | 35.84 | ? (a) |
BI (c) | BioNLP'11 | Bacteria Gene Interactions | 77.24 | ? (a) |
CO | BioNLP'11 | Protein/Gene Coreference | 29.89 | ? (a) |
REL | BioNLP'11 | Entity Relations | ? (a) | ? (a) |
REN | BioNLP'11 | Bacteria Gene Renaming | 85.04 | ? (a) |
GE09 | BioNLP'09 | GENIA Event Extraction | 49.88 | ? (a) |
DDI (c) | DDI'11 | Drug-drug Interactions | 60.38 | 62.58 |
(a) Most BioNLP'11 hidden test set online evaluation services are currently down for server relocation, so test model performance cannot be measured
(b) GENIA results are Approximate Span/Approximate Recursive criterion F-scores for tasks 1/2/3. The task 2 Approximate Span/Approximate Recursive F-score, used in the Shared Task, is not provided by the current online evaluation system and is thus unavailable for the test set.
(c) In the DDI11 and BI11 tasks given named entities include entity types other than "Protein" that cannot be detected with BANNER. For these tasks there are additional, experimental DDI11-FULL and BI11-FULL models which enable DDI11 and BI11 relation extraction from unannotated text.
TEES can be used to process large amounts of data and has been successfully used to extract events for GE11, EPI11 and REL11 tasks from all PubMed abstracts and all PMC full text articles. The resulting event dataset can be used through the EVEX database. TEES 2.0 includes the batch system used for this large scale text mining, enabling the same approach to be used by other researchers.
The batch processing tool batch.py is a simple program that walks a tree of input files and launches processes for those matching defined criteria. It is not specifically tied to any TEES program, but will most commonly be used with classify.py for large scale text mining. In such work, it is recommended to divide the input data into small, manageable batches (e.g. the size of a BioNLP'11 training set would be good for many systems) and then parallelize the processing in a cluster environment. After such an input tree is constructed, batch.py can automate the rest of the processing.
The batch program has a command argument which is the template for the program to be run for a matching input file. For example, we could have the following file tree which needs to be processed with TEES.
dir1
|- input1.tar.gz
|- otherfile.txt
|- dir2
|-input2.tar.gz
|-input3.tar.gz
The files input1-3 are archives containing a number of documents stored in ASCII txt-files inside the archive. To predict GE-type events for the input files, the following command could be used:
python batch.py -i /dir1 -r '.*input[1-3]*.tar.gz$' -n SLURM --limit 100 -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %a -m GE"
With this command, batch.py walks through the directory dir1 and its subdirectories, and processes as input files all files and subdirectories matching the regular expression "-r". The cluster environment is using the SLURM job scheduler, and batch.py is told to use it with the connection argument "-n" (other supported job managers are "PBS", "LSF" and "Unix", for clusters with no specialized job scheduler). The limit argument defines the maximum number of jobs that can run in parallel. When this limit is reached, batch.py waits until a job has finished before submitting the next one.
After the input and job scheduling arguments are defined, the actual command is defined in the command argument "-c". This command will be run in a shell, at the directory where the input file is. For this reason, it is very important to use absolute paths, for example the classify.py program is defined with a full path. The input file marker %i is used for both the classify.py input file and its output file stem, leading to classify.py output files appearing alongside the input files in the same directory. The command template supports the following markers, which will be replaced by values relevant for the current input file batch.py is processing:
Marker | Description |
---|---|
%i | The current input file relative to the batch.py input directory |
%a | An absolute path to the current input file |
%b | The filename of the current input file |
%j | A job tag, if defined in batch.py |
%o | The current output directory, if defined in batch.py |
The batch processing program can also be given an optional output argument. If the output argument is defined, the %o marker will refer to an output directory corresponding to the input directory. For example, if in the preceding example the -o switch of batch.py was used to define an output directory /outdir, the output files would appear in a replicate directory structure.
dir1 .................................. outdir
|- input1.tar.gz ....................... |- input1.tar.gz-OUTPUTFILE
|- otherfile.txt ....................... |- (skipped by regex)
|- dir2 ................................ |- dir2
|-input2.tar.gz ........................ |- input2.tar.gz-OUTPUTFILE
|-input3.tar.gz ........................ |- input2.tar.gz-OUTPUTFILE
In this case, the batch processing command would be of the form:
python batch.py -i /dir1 -o /outdir -r '.*input[1-3]*.tar.gz$' -n SLURM --limit 100 -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %o/%b -m GE11"
When using a separate output directory, classify.py is given an output argument of "%o/%b" which gets expanded to the current output directory plus the input file name, leading to the mirrored directory structure shown above.
Sometimes multiple tasks need to be predicted for the same input files. In case these tasks use the same preprocessing settings, a lot of time can be saved by re-using the preprocessor output. For example, it might be necessary to predict also the EPI and REL tasks for the example used in the previous section. First, the GE task is predicted, and the output is directed into a new directory.
python batch.py -i /dir1 -o /outdir -r '.*input[1-3]*.tar.gz$' -n SLURM --limit 100 -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %o/%b-%j -m %j" -j GE11
Here the job-tag switch ("-j") is used to add a unique tag to the job name. In the command template, this tag is available through the marker "%j", and is used to both define the task model ("-m %j") as well as add the tag to the output files ("-o %o/%b-%j").
Once processing finishes, the output directory will contain the output files, including preprocessor output, named as "outdir/inputfilename-GE11-preprocessed-xml.gz". To re-use the preprocessor output for the GE11 task, which is compatible with the EPI11 and REL11 tasks, these files are used as input for the EPI11 and REL11 classification. The classifications are run with the following commands:
python batch.py -i outdir -n SLURM --limit 100 -r '.*preprocessed.xml.gz$' -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %a-%j -m %j --omitSteps PREPROCESS" -j EPI11
and
python batch.py -i outdir -n SLURM --limit 100 -r '.*preprocessed.xml.gz$' -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %a-%j -m %j --omitSteps PREPROCESS" -j REL11
In these commands, the output directory from the GE11 classification is used as the input directory. Since a new output directory is not defined, the EPI11 and REL11 output files will appear next to the GE11 output files. The regular expression matches only the preprocessor output, ignoring the numerous other output files for the GE11 task. When making the first, GE11, classification the job tag was useful for marking the output files as belonging to that task. When running multiple classifications for the same input files, the job tag is mandatory. TEES names the jobs according to the input file name + job tag, so if no job tag is defined, TEES couldn't tell which task an input file has been processed for. Finally, since the preprocessor output is of course already preprocessed, the omitSteps argument is used to stop classify.py from rerunning the preprocessor.