-
Notifications
You must be signed in to change notification settings - Fork 7
Incremental distance tree directory
The data produced by incremental distance tree building must persist.
They is stored in a specifically organized directory, referred to as an incremental distance tree directory, which contains the following files and subdirectories.
-
Biological parameters
-
variance
: parameters to be placed aftermakeDistTree -variance
.<br> For example, to implement the variance function L^4.5 this file must contain:pow -variance_power 4.5
; -
hybridness_min
: min. hybridness, > 1, or 0 meaning no hybrids identification; -
dissim_boundary
: Point of discontinuity in the dissimilarity distribution (= genospecies barrier), > 0 orNAN
; -
genogroup_barrier
: dissimilarity barrier for genogroup identification to find genogroup outliers, > 0, orNAN
meaning no genogroup outliers; less or equaldissim_boundary
; -
delete_criterion_outliers
: the presence of this file means that all criterion outliers will be deleted from the tree; -
good
: list of objects which should not be removed as outliers; -
phen/
: link to the directory with "phenotype" attributes of objects, see Taxonomy miscongruence;
-
-
Database
-
server
: SQL server name; -
database
: database on the SQL server; -
bulk
: local directory for bulk inserts; -
bulk_remote
: path in Universal Naming Convention to the local bulk directory.
-
-
Grid engine
-
pairs2dissim.grid
: min. number of dissimilarity requests to be processed on a grid, > 0; -
object2dissim.grid
: min. number of invocations ofobject2closest.sh
to be processed on a grid, > 0; -
object2closest.sql
: the presence of this file means thatrequest_closest.sh
queries an SQL database and the number of concurrent connections must be restricted to 30; -
nogrid
: the presence of this file means that threads on the main computer must be used instead of a grid;
-
-
Computer processing
-
large
: the presence of this file means that the files in thenew/
andphen/
directories are grouped into subdirectories namedfile2hash <file name>
; -
threads
: number of threads; if this file is absent then the parameter-threads 15
will be used;
-
-
version
: version number of the files; -
tree
: distance tree in an internal format; -
dissim
: file with dissimilarities. Format of a line: obj_name1 obj_name2 dissimilarity; -
indiscern
: pairs of indiscernible objects; -
new/
: directory containing the names (zero-length files) of new objects to be added to the tree and then to be removed from this directory, the reservoir of objects; -
good.expanded
: the objects indiscernible with the objects ingood
; -
outlier_genogroup
: optional file with genogroup outliers; -
runlog
: start times of the incrementations; -
hist/
: directory with the historic versions of thetree
files and the temporary files created bydistTree_inc.sh
, where each file has the extension .version.<br> Fileshybrid.version
contain the output ofmakeDistTree -delete_hybrids
; -
finished
: empty file created if the iterations of the incremental tree building are finished. -
tree.released
: link to the latest released tree.
The presence of these files acts act as Boolean flags controlling the work of distTree_inc.sh
-
stop
:distTree_inc.sh
must stop; -
skip
: incrementations must be skipped and the next steps must start.
These scripts act like virtual procedures in object-oriented programming. The information on which object of the reservoir is an outlier, in the tree or yet unprocessed is stored in a "database" which most conveniently can be a relational database. The database also stores the genogroup partitioning of the objects. The code for database interaction is not provided.
Scripts should finish with exit code 0.
-
genogroup2db.sh
: input is filegenogroup_table
; output is fileoutlier-genogroup
. Update genogroup information in the database, find genogroup outliers and print them inoutlier-genogroup
; -
objects_in_tree.sh
: Tell the database whether a list of objects is or is not in the tree. Parameters:-
list of objects
-
in_tree: 0/1
-
-
object2closest.sh
: find approximately 100 closest objects which are in the tree for an input object.<br> Parameter: input_obj_name;<br> Printed output: file with 100 lines where each line has format: found_obj_name; -
outlier2db.sh
-
pairs2dissim.sh
: Compute dissimilarities for requested pairs of objects. This script is invoked in parallel. Parameters:-
input file of pairs of objects where each line has format: obj_name1 obj_name2
-
new object file or '': for placement of a new object
-
output file where each line has format: obj_name1 obj_name2 dissimilarity, where dissimilarity is a non-negative number,
inf
ornan
. -
output log file for error messages. These files are created in the temporary directories
search/
anddr.out/
.
-
-
qc.sh
: quality control: check whether thetree
file, the subdirectorynew/
and the database agree. -
qc_object.sh
: quality control of one object.