Classifying

Once TEES has been setup with configure.py and if the models have been installed, events and relations can be predicted. Assuming the preprocessing tools have also been installed, unparsed text can be used as input. To predict events/relations with TEES, classify.py is used. The program takes several arguments, and the mandatory ones are the following:

Argument	Description
-i, --input	The input data, which can be an interaction XML file, a txt ASCII file or an archive of BioNLP Shared Task documents
-m, --model	The TEES model file. This can be either a user-built file, or one of the default models provided with TEES
-o, --output	The stem of the output files

Classification is very simple with a builtin model. To predict events corresponding to the BioNLP'11 GENIA task for some text, classify.py would be called with the following arguments

python classify.py -i [INPUT] -m GE11 -o [OUTPUT]

where "GE11" refers to the builtin "GE11-test" model. TEES will try to determine if preprocessing is needed, and if it is, will use the preprocessing pipeline to prepare the required analyses for GE11-type event extraction. The output will consist of several files, being usually at least a log file "OUTPUT-log.txt" and an interaction XML file "OUTPUT-pred.xml.gz". If preprocessing was needed, there is also a preprocessor output file "OUTPUT-preprocessed.xml.gz" and if the model is for one of the BioNLP Shared Tasks, a Shared Task format file "OUTPUT-events.tar.gz". If intermediate files need to be saved, a directory for these can be defined with the workdir argument "-w".

Normally preprocessing is disabled only when classifying a built-in corpus file. If your input is already parsed (or whatever the model file requires) preprocessing can be omitted with the argument

--omitSteps PREPROCESS

The classification program also has a demonstration mode where a single abstract from PubMed can be dynamically downloaded and classified as input. If an integer is passed as the input argument, classify.py attempts to download and classify an article abstract with that PubMed id.

Models

The following table lists the models provided with TEES 2.0. They can be used with classify.py, with the -m parameter, with the optional suffix "-devel" or "-test". If no suffix is given, a "-test" model is assumed. Performance may vary slightly depending on the environment (e.g. 32 vs 64 bit) in which the program is used.

Model	Shared Task	Description	Devel F-score	Test F-score
GE	BioNLP'11	GENIA Event Extraction	55.20 / 52.81 / 38.01 (b)	52.57 / ? / 28.12 (b)
EPI	BioNLP'11	Epigenetics and Post-translational Modifications	56.11	54.16
ID	BioNLP'11	Infectious Diseases	50.98	53.37
BB	BioNLP'11	Bacteria Biotopes	35.84	? (a)
BI (c)	BioNLP'11	Bacteria Gene Interactions	77.24	? (a)
CO	BioNLP'11	Protein/Gene Coreference	29.89	? (a)
REL	BioNLP'11	Entity Relations	? (a)	? (a)
REN	BioNLP'11	Bacteria Gene Renaming	85.04	? (a)
GE09	BioNLP'09	GENIA Event Extraction	49.88	? (a)
DDI (c)	DDI'11	Drug-drug Interactions	60.38	62.58

(a) Most BioNLP'11 hidden test set online evaluation services are currently down for server relocation, so test model performance cannot be measured

(b) GENIA results are Approximate Span/Approximate Recursive criterion F-scores for tasks 1/2/3. The task 2 Approximate Span/Approximate Recursive F-score, used in the Shared Task, is not provided by the current online evaluation system and is thus unavailable for the test set.

(c) In the DDI11 and BI11 tasks given named entities include entity types other than "Protein" that cannot be detected with BANNER. For these tasks there are additional, experimental DDI11-FULL and BI11-FULL models which enable DDI11 and BI11 relation extraction from unannotated text.

Batch Processing

TEES can be used to process large amounts of data and has been successfully used to extract events for GE11, EPI11 and REL11 tasks from all PubMed abstracts and all PMC full text articles. The resulting event dataset can be used through the EVEX database. TEES 2.0 includes the batch system used for this large scale text mining, enabling the same approach to be used by other researchers.

The batch processing tool batch.py is a simple program that walks a tree of input files and launches processes for those matching defined criteria. It is not specifically tied to any TEES program, but will most commonly be used with classify.py for large scale text mining. In such work, it is recommended to divide the input data into small, manageable batches (e.g. the size of a BioNLP'11 training set would be good for many systems) and then parallelize the processing in a cluster environment. After such an input tree is constructed, batch.py can automate the rest of the processing.

The batch program has a command argument which is the template for the program to be run for a matching input file. For example, we could have the following file tree which needs to be processed with TEES.

dir1
   |- input1.tar.gz
   |- otherfile.txt
   |- dir2
         |-input2.tar.gz
         |-input3.tar.gz

The files input1-3 are archives containing a number of documents stored in ASCII txt-files inside the archive. To predict GE-type events for the input files, the following command could be used:

python batch.py -i /dir1 -r '.*input[1-3]*.tar.gz$' -n SLURM --limit 100 -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %a -m GE"

With this command, batch.py walks through the directory dir1 and its subdirectories, and processes as input files all files and subdirectories matching the regular expression "-r". The cluster environment is using the SLURM job scheduler, and batch.py is told to use it with the connection argument "-n" (other supported job managers are "PBS", "LSF" and "Unix", for clusters with no specialized job scheduler). The limit argument defines the maximum number of jobs that can run in parallel. When this limit is reached, batch.py waits until a job has finished before submitting the next one.

After the input and job scheduling arguments are defined, the actual command is defined in the command argument "-c". This command will be run in a shell, at the directory where the input file is. For this reason, it is very important to use absolute paths, for example the classify.py program is defined with a full path. The input file marker %i is used for both the classify.py input file and its output file stem, leading to classify.py output files appearing alongside the input files in the same directory. The command template supports the following markers, which will be replaced by values relevant for the current input file batch.py is processing:

Marker	Description
%i	The current input file relative to the batch.py input directory
%a	An absolute path to the current input file
%b	The filename of the current input file
%j	A job tag, if defined in batch.py
%o	The current output directory, if defined in batch.py

Using a different output directory with batch processing

The batch processing program can also be given an optional output argument. If the output argument is defined, the %o marker will refer to an output directory corresponding to the input directory. For example, if in the preceding example the -o switch of batch.py was used to define an output directory /outdir, the output files would appear in a replicate directory structure.

dir1 .................................. outdir
   |- input1.tar.gz ....................... |- input1.tar.gz-OUTPUTFILE
   |- otherfile.txt ....................... |- (skipped by regex)
   |- dir2 ................................ |- dir2
         |-input2.tar.gz ........................ |- input2.tar.gz-OUTPUTFILE
         |-input3.tar.gz ........................ |- input2.tar.gz-OUTPUTFILE

In this case, the batch processing command would be of the form:

python batch.py -i /dir1 -o /outdir -r '.*input[1-3]*.tar.gz$' -n SLURM --limit 100 -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %o/%b -m GE11"

When using a separate output directory, classify.py is given an output argument of "%o/%b" which gets expanded to the current output directory plus the input file name, leading to the mirrored directory structure shown above.

Using multiple models

Sometimes multiple tasks need to be predicted for the same input files. In case these tasks use the same preprocessing settings, a lot of time can be saved by re-using the preprocessor output. For example, it might be necessary to predict also the EPI and REL tasks for the example used in the previous section. First, the GE task is predicted, and the output is directed into a new directory.

python batch.py -i /dir1 -o /outdir -r '.*input[1-3]*.tar.gz$' -n SLURM --limit 100 -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %o/%b-%j -m %j" -j GE11

Here the job-tag switch ("-j") is used to add a unique tag to the job name. In the command template, this tag is available through the marker "%j", and is used to both define the task model ("-m %j") as well as add the tag to the output files ("-o %o/%b-%j").

Once processing finishes, the output directory will contain the output files, including preprocessor output, named as "outdir/inputfilename-GE11-preprocessed-xml.gz". To re-use the preprocessor output for the GE11 task, which is compatible with the EPI11 and REL11 tasks, these files are used as input for the EPI11 and REL11 classification. The classifications are run with the following commands:

python batch.py -i outdir -n SLURM --limit 100 -r '.*preprocessed.xml.gz$' -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %a-%j -m %j --omitSteps PREPROCESS" -j EPI11

and

python batch.py -i outdir -n SLURM --limit 100 -r '.*preprocessed.xml.gz$' -c "python /ABSPATH_TO_TEES/classify.py -i %a -o %a-%j -m %j --omitSteps PREPROCESS" -j REL11

In these commands, the output directory from the GE11 classification is used as the input directory. Since a new output directory is not defined, the EPI11 and REL11 output files will appear next to the GE11 output files. The regular expression matches only the preprocessor output, ignoring the numerous other output files for the GE11 task. When making the first, GE11, classification the job tag was useful for marking the output files as belonging to that task. When running multiple classifications for the same input files, the job tag is mandatory. TEES names the jobs according to the input file name + job tag, so if no job tag is defined, TEES couldn't tell which task an input file has been processed for. Finally, since the preprocessor output is of course already preprocessed, the omitSteps argument is used to stop classify.py from rerunning the preprocessor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classifying

Models

Batch Processing

Using a different output directory with batch processing

Using multiple models

Clone this wiki locally