Skip to content

Incremental distance tree directory

Vyacheslav Brover edited this page Oct 15, 2024 · 7 revisions

The data produced by incremental distance tree building must persist.

They is stored in a specifically organized directory, referred to as an incremental distance tree directory, which contains the following files and subdirectories.

Constant files

  • Biological parameters

    • variance: parameters to be placed after makeDistTree -variance.<br> For example, to implement the variance function L^4.5 this file must contain: pow -variance_power 4.5;

    • hybridness_min: min. hybridness, > 1, or 0 meaning no hybrids identification;

    • dissim_boundary: Point of discontinuity in the dissimilarity distribution (= genospecies barrier), > 0 or NAN;

    • genogroup_barrier: dissimilarity barrier for genogroup identification to find genogroup outliers, > 0, or NAN meaning no genogroup outliers; less or equal dissim_boundary;

    • delete_criterion_outliers: the presence of this file means that all criterion outliers will be deleted from the tree;

    • good: list of objects which should not be removed as outliers;

    • phen/: link to the directory with "phenotype" attributes of objects, see Taxonomy miscongruence;

  • Database

    • server: SQL server name;

    • database: database on the SQL server;

    • bulk: local directory for bulk inserts;

    • bulk_remote: path in Universal Naming Convention to the local bulk directory.

  • Grid engine

    • pairs2dissim.grid: min. number of dissimilarity requests to be processed on a grid, > 0;

    • object2dissim.grid: min. number of invocations of object2closest.sh to be processed on a grid, > 0;

    • object2closest.sql: the presence of this file means that request_closest.sh queries an SQL database and the number of concurrent connections must be restricted to 30;

    • nogrid: the presence of this file means that threads on the main computer must be used instead of a grid;

  • Computer processing

    • large: the presence of this file means that the files in the new/ and phen/ directories are grouped into subdirectories named file2hash <file name>;

    • threads: number of threads; if this file is absent then the parameter -threads 15 will be used;

Automatically updated files and directories

  • version: version number of the files;

  • tree: distance tree in an internal format;

  • dissim: file with dissimilarities. Format of a line: obj_name1 obj_name2 dissimilarity;

  • indiscern: pairs of indiscernible objects;

  • new/: directory containing the names (zero-length files) of new objects to be added to the tree and then to be removed from this directory, the reservoir of objects;

  • good.expanded: the objects indiscernible with the objects in good;

  • outlier_genogroup: optional file with genogroup outliers;

  • runlog: start times of the incrementations;

  • hist/: directory with the historic versions of the tree files and the temporary files created by distTree_inc.sh, where each file has the extension .version.<br> Files hybrid.version contain the output of makeDistTree -delete_hybrids;

  • finished: empty file created if the iterations of the incremental tree building are finished.

  • tree.released: link to the latest released tree.

Manually updated files

The presence of these files acts act as Boolean flags controlling the work of distTree_inc.sh

  • stop: distTree_inc.sh must stop;

  • skip: incrementations must be skipped and the next steps must start.

Scripts

These scripts act like virtual procedures in object-oriented programming. The information on which object of the reservoir is an outlier, in the tree or yet unprocessed is stored in a "database" which most conveniently can be a relational database. The database also stores the genogroup partitioning of the objects. The code for database interaction is not provided.

Scripts should finish with exit code 0.

  • genogroup2db.sh: input is file genogroup_table; output is file outlier-genogroup. Update genogroup information in the database, find genogroup outliers and print them in outlier-genogroup;

  • objects_in_tree.sh: Tell the database whether a list of objects is or is not in the tree. Parameters:

    • list of objects

    • in_tree: 0/1

  • object2closest.sh: find approximately 100 closest objects which are in the tree for an input object.<br> Parameter: input_obj_name;<br> Printed output: file with 100 lines where each line has format: found_obj_name;

  • outlier2db.sh

  • pairs2dissim.sh: Compute dissimilarities for requested pairs of objects. This script is invoked in parallel. Parameters:

    • input file of pairs of objects where each line has format: obj_name1 obj_name2

    • new object file or '': for placement of a new object

    • output file where each line has format: obj_name1 obj_name2 dissimilarity, where dissimilarity is a non-negative number, inf or nan.

    • output log file for error messages. These files are created in the temporary directories search/ and dr.out/.

  • qc.sh: quality control: check whether the tree file, the subdirectory new/ and the database agree.

  • qc_object.sh: quality control of one object.

Temporary files created and deleted by distTree_inc.sh

  • finished: file is created when all incrementations have been done.