-
Notifications
You must be signed in to change notification settings - Fork 42
Building SVM Examples
Most TEES components can be used also outside the main classify/train interface. One such component is the ExampleBuilder, the class used to convert Interaction XML into machine learning examples and feature vectors. Using an ExampleBuilder alone can be helpful when TEES feature vectors are to be used as part of an external system.
The ExampleBuilder is the base class for specialized example builders, such as the EntityExampleBuilder and the EdgeExampleBuilder. All of the classes derived from ExampleBuilder can be called independently via the ExampleBuilders/ExampleBuilder.py interface. For example, to produce edge examples for a given interaction XML file, the following command can be used:
python ExampleBuilder.py -b EdgeExampleBuilder -i input.xml -o /tmp/edge-examples -c /tmp/edge-ids.classes -f /tmp/edge-ids.features -p McCC --addIds
The EdgeExampleBuilder is called to produce SVM examples for the interaction XML file "input.xml". A named parse element must be defined with the "-p" switch, just like when using "train.py". The examples produced go to the /tmp directory (in real applications you'll probably save them somewhere else). The example file follows the SVM-multiclass format where each line starts with a class id integer, followed by the feature vector (listed as ordered id:value pairs) and finally, separated by a "#", is the comment section, containing key:value pairs relevant for each example builder.
The class names (corresponding to the class ids) and the feature names (corresponding to the feature ids) are stored in their own files, defined by the -c and -f options. If these files already exist, the ExampleBuilder will use an already defined id (if it exists) for a class or a feature, making new examples consistent with e.g. a previously trained model. The updated class and feature id sets are saved over the existing files only if the "--addIds" option is used.
If the "--structure" option is not defined, the input corpus is analysed by the StructureAnalyzer to determine example generation targets and limitations. An existing structure analysis file, generated by Detectors/StructureAnalyzer.py, can be used instead, when given as the argument of this option.
To demonstrate the use of a different example builder, the EntityExampleBuilder can be used for the same "input.xml" file as in the previous example, with the command:
python ExampleBuilder.py -b EntityExampleBuilder -i input.xml -o /tmp/entity-examples -c /tmp/entity-ids.classes -f /tmp/entity-ids.features -p McCC --addIds
Parameters can be passed to the ExampleBuilder class with the -x option. For example, the EntityExampleBuilder by default generates examples only for word tokens which are not part of given entities, and only for sentences with at least one given entity, since its default function is to detect trigger words, interaction cues that are only present in sentences with a marked protein/gene name (a given entity) and which do not overlap with such given entities. If we want to build entity examples for all tokens, ignoring all existing entity annotations, we can do this by passing the EntityExampleBuilder two parameters, "build_for_nameless" to process each sentence even if they have no given entities, and "names" to process each token, even if they are part of a given entity. These parameters are given with the -x switch, extending the previous example as follows:
python ExampleBuilder.py -b EntityExampleBuilder -i input.xml -o /tmp/entity-examples -c /tmp/entity-ids.classes -f /tmp/entity-ids.features -p McCC --addIds -x build_for_nameless:names