doc/Quick_Start.txt


                            CoMet Quick Start Guide
                            -----------------------

1. How to build
----------------

The following is an example of how to build CoMet on the OLCF Summit system:

export OLCF_PROJECT=stf006  #---replace stf006 with your OLCF account id.
cd $MEMBERWORK/$OLCF_PROJECT
mkdir comet_work
cd comet_work
module load git
git clone https://code.ornl.gov/wjd/genomics_gpu.git
# OPTIONAL:
#   export COMET_REPO_DIR=$PWD/genomics_gpu
#   export COMET_BUILDS_DIR=$PWD
#   export COMET_INSTALLS_DIR=$PWD/installs
./genomics_gpu/scripts/configure_all.sh
./genomics_gpu/scripts/make_all.sh

One can also optionally run the tester for this case as follows:

OLCF_PROJECT=stf006  #---replace stf006 with your OLCF account id.
bsub -P $OLCF_PROJECT -Is -nnodes 2 -W 120 -alloc_flags gpumps $SHELL
cd $MEMBERWORK/$OLCF_PROJECT/comet_work
./genomics_gpu/scripts/test_all.sh

NOTES:

- by default, the build process will build several code versions
(single vs. double precision, release vs. test/debug version, non-MPI vs. MPI
version).

- The choice of single vs. double precision impacts what form of arithmetic
is used for the Czekanowski method.

- Using single precision for CCC and DUO has little impact on the numerics
of the calculation but enables certain performance optimizations
such as metrics compression.

- The code uses an out-of-tree build system. By default the build directory
is placed in the working diretory where the configure and build scripts are run.


2. Methods
-----------

CoMet computes comparisons of all pairs (or triples) of vectors in a given
set of vectors in order to identify correlations between similar vectors.
The structure of the computation is similar to a large distributed symmetric
matrix-matrix product (2-way) or tensor product (3-way). It supports several
comparison metrics:

- the Proportional Similarity (Czekanowski) metric, which takes real-valued
vectors and single real-valued number for each vector comparison;

- the CCC and DUO metrics, whose imputs are vectors of 2-bit entries and
whose outputs are a 2X2 or 2X2X2 table of real-valued entries for each
vector comparison.

CoMet is parallelized using MPI and accelerated on modern GPUs. It supports
features such as staging to compute partial results over a series of runs,
thresholding to write results only for highly correlated vectors, and support
for problems with incomplete data.

For further information please refer to the following:

W. Joubert, J. Nance, D. Weighill, D. Jacobson,
"Parallel Accelerated Vector Similarity Calculations for Genomics Applications,"
Parallel Computing, vol. 75, July 2018, pp. 130-145,
https://www.sciencedirect.com/science/article/pii/S016781911830084X,
https://arxiv.org/abs/1705.08210.

W. Joubert, J. Nance, S. Climer, D. Weighill, D. Jacobson,
"Parallel Accelerated Custom Correlation Coefficient Calculations
for Genomics Applications," Parallel Computing 84 (2019), 15-23,
https://www.sciencedirect.com/science/article/pii/S0167819118301431,
https://arxiv.org/abs/1705.08213

Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice,
Kjiersten Fagnan, Daniel Jacobson, "Attacking the Opioid Epidemic:
Determining the Epistatic and Pleiotropic Genetic Architectures
for Chronic Pain and Opioid Addiction," SC18 Gordon Bell Award paper,
https://dl.acm.org/citation.cfm?id=3291732

"GPU-enabled comparative genomics calculations on leadership-class HPC
systems," http://on-demand.gputechconf.com/gtc/2017/presentation/s7156-wayne-joubert-comparative.pdf

"CoMet: An HPC application for comparative genomics calculations,"
https://www.olcf.ornl.gov/wp-content/uploads/2017/11/2018UM-Day1-Joubert.pdf


3. Options
-----------

genomics_metric: calculation of comparison metrics from genomics data

Usage:

    genomics_metric <option> ...

Options:

    --num_field <value>
        the total number of elements in each vector. Either num_field or
        num_field_local must be specified.

    --num_field_local <value>
        the number of elements in each vector on each process (MPI rank).
        Either num_field or num_field_local must be specified.

    --num_vector <value>
        the total number of vectors. Either num_vector or num_vector_local
        must be specified.

    --num_vector_local <value>
        the number of vectors on each process (MPI rank). Either num_vector
        or num_vector_local must be specified.

    --metric_type <value>
        metric type to compute (czekanowski=Czekanowski (default),
        ccc=CCC, duo=DUO)

    --ccc_multiplier <value>
        front multiplier value used to calculate the CCC metric
        (default floating point value is 4.5 for CCC).

    --duo_multiplier <value>
        front multiplier value used to calculate the DUO metric
        (default floating point value is 4.0 for DUO).

    --ccc_param <value>
        fixed coefficient value used to calculate the CCC or DUO metric
        (default floating point value is 2/3).

    --sparse <value>
        for CCC and DUO metric, interpret each vector entry set to binary
        "10" as a missing data element (yes=yes, no=no (default))

    --num_way <value>
        dimension of metric to compute (2=2-way (default), 3=3-way)

    --all2all <value>
        whether to perform global all-to-all rather than computing
        on each processor separately (yes=yes, no=no (default))

    --compute_method <value>
        manner of computing the result (CPU=cpu, GPU=gpu (default),
        REF=reference implementation (slower, computed on CPU))

    --tc <value>
        for CCC and DUO, perform computation using a standard GEMM computation
        that employs special hardware such as GPU tensor cores when available
        (0=no (default), 1=fp16/fp32, 2=int8/int32, 3=fp32, 4=auto,
        5=int1/int32, 6=int4/int32)

    --num_tc_steps <value>
        for tc methods, tuning parameter to reduce memory usage
        by breaking GEMM into multiple steps (default 1)

    --num_proc_vector <value>
        blocking factor to denote number of blocks used to decompose
        the total number of vectors across processes (MPI ranks)
        (default is the total number of procs requested)

    --num_proc_field <value>
        blocking factor to denote number of blocks used to decompose
        each vector across process (MPI rank) (default is 1)

    --num_proc_repl <value>
        process replication factor.  For each block along the vector
        and field axes, this number of processes (MPI ranks) is applied to
        computations for the block (default is 1)

    --num_stage <value>
        the number of stages the computation is divided into, for breaking
        the run campaign into smaller parts and reducing the memory footprint
        (default is 1) (available for 3-way case only)

    --stage_min <value>
        the lowest stage number of the sequence of stages to be computed
        for this run (0-based, default is 0)

    --stage_max <value>
        the highest stage number of the sequence of stages to be computed
        for this run (0-based, default is num_stage-1)

    --num_phase <value>
        the number of phases the computation is divided into, for breaking
        the run campaign into smaller parts and reducing the memory footprint
        (default is 1)

    --phase_min <value>
        the lowest phase number of the sequence of phases to be computed
        for this run (0-based, default is 0)

    --phase_max <value>
        the highest phase number of the sequence of phases to be computed
        for this run (0-based, default is num_phase-1)

    --input_file <value>
        string denoting the filename or file pathname of binary file
        containing all input vectors.  If this option not present,
        a synthetic test case is run.

    --problem_type <value>
        the kind of synthetic test case to run. Allowed choices are
        analytic (default) or random.

    --output_file_stub <value>
        string denoting the filename or pathname stub of files
        used to store result metrics.  Metric values are written to files
        whose names are formed by appending a unique identifier
        (e.g., process number) to the end of this string.  If this
        option is absent, no output files are written.

    --histograms_file <value>
        string denoting the filename or pathname of text file
        used to store histograms of metrics entries, used for scoping runs
        to determine metric value thresholds.  Note all computed metrics
        entries are histogrammed irrespective of thresholding.  Only available
        for CCC and DUO metrics.

    --threshold <value>
        output each metric result value only if its magnitude is greater than
        this threshold.  If set negative, no thresholding is done
        (default -1)

    --threshold <valueLL>,<valueLH>,<valueHH>,<valueLLHH>
    --threshold <valueLLL>,<valueLLH>,<valueLHH>,<valueHHH>,<valueLLLHHH>
        alternate threshold option for CCC and DUO methods (2-way, 3-way forms,
        respectively), used to specify individual thresholds for different
        table entries.  For example, for 3-way, <valueLLH> is the threshold
        for table entries (0,0,1) and also table entries with equivalent index
        permutations (0,1,0), (1,0,0).  Setting <valueLLHH> causes the
        additional output of table entries (0,0) and (1,1) (output as individual
        values) if both entries are positive and if both summed together
        exceed this threshold; <valueLLLHHH> is analogous for 3-way.
        All thresholds can be disabled by being set negative; otherwise
        all thresholds must be nonnegative.

    --metrics_shrink <value>
        anticipated reduction factor in the number of metric entries
        stored due to thresholding, used to reduce CPU memory footprint
        to allow larger problems to be solved.  For example, a value of
        10 specifies that memory will be allocated assuming no more than
        1/10 of metric entries pass threshold for any stage, phase or
        process. (default 1.0)

    --checksum <value>
        compute checksum of the metrics results that pass threshold
        (yes=yes (default), no=no)

    --verbosity <value>
       verbosity level of output (0=none, 1=some (default) 2,3=more)


4. File Formats
----------------

All CoMet I/O makes use of binary files for speed and ease of indexing.
See the tools/ directory of the CoMet repository for tools to convert
between binary and human-readable text formats.

For both input and output, the endianness of integer and floating point
values in the files matches that of the system on which the code is run.

4.1 Input File Formats
-----------------------

CoMet input is stored as a single binary file; each process reads the part
of the file it needs.  Elements of the matrix of vectors are stored in
lexicographical order in the file, with the field dimension varyng
fastest.  Raw values are stored in a packed fashion with no indexing
data; dimensions are supplied via command line arguments.

For the Czekanowski metric, values are stored packed in sequence as
4-byte floats or 8-byte doubles, depending on the code version being
used.

CCC/DUO 2-bit values are packed 4 per byte, starting at the least significant
bit of the byte.  For the 2 bits, the higher order bit is considered the
"first" bit, the lower order bit is considered to be "second," thus
for "01" or "(0,1)" the first but is "0" and the second bit is "1".
Note for the sparse case "(1,0)" is the marker for a missing vector entry.
The last byte of each vector is padded with zeros for the high-order bits
before the next vector in the file is started.

4.2 Output File Formats
------------------------

Output files are written one file per process.  The files are written
in binary format as a packed series of results, written in no particular
order.  Each value is stored as two (for 2-way) or three (for 3-way)
4-byte unsigned integer indices, followed by a 4-byte floating point
metric value.

For the Czekanowski metric, each integer index denotes the (0-based) vector
number relevant to the metric value.

For CCC and DUO, the lowest order bit of the integer denotes
the 0/1 index into the relevant 2X2 (or 2X2X2) table entry being written,
and all other bits of the integer denote the (0-based) vector number.


5. Execution Examples
----------------------

# The following test runs assume:

OLCF_PROJECT=stf006  #---replace stf006 with your OLCF account id.
bsub -P $OLCF_PROJECT -Is -nnodes 2 -W 120 $SHELL
cd $MEMBERWORK/$OLCF_PROJECT/comet_work

# The following code also assumes bash shell.

#--------------------
# Small case, synthetic test problem, 2-way Czekanowski metric.
#--------------------

AR_PERFORMANCE_FLAGS="PAMI_IBV_ENABLE_DCT=1 PAMI_ENABLE_STRIPING=1 PAMI_IBV_ADAPTER_AFFINITY=0 PAMI_IBV_QP_SERVICE_LEVEL=8 PAMI_IBV_ENABLE_OOO_AR=1"
EXECUTABLE="./install_release_summit/bin/genomics_metric"

env OMP_NUM_THREADS=7 $AR_PERFORMANCE_FLAGS \
jsrun --nrs 1 --rs_per_host 1 \
  --tasks_per_rs 1 --cpu_per_rs 7 --bind packed:7 --gpu_per_rs 1 -X 1 \
  $EXECUTABLE \
    --num_field 2 --num_vector 4 --num_proc_vector 1 \
    --metric_type czekanowski --num_way 2 \
    --compute_method GPU --all2all yes --verbosity 3

vec_proc 0 vec 0 field_proc 0 field 0 value 1.000000e+00
vec_proc 0 vec 0 field_proc 0 field 1 value 2.000000e+00
vec_proc 0 vec 1 field_proc 0 field 0 value 3.000000e+00
vec_proc 0 vec 1 field_proc 0 field 1 value 4.000000e+00
vec_proc 0 vec 2 field_proc 0 field 0 value 2.000000e+00
vec_proc 0 vec 2 field_proc 0 field 1 value 1.000000e+00
vec_proc 0 vec 3 field_proc 0 field 0 value 4.000000e+00
vec_proc 0 vec 3 field_proc 0 field 1 value 3.000000e+00
element (0,1): value: 5.99999999999999978e-01
element (0,2): value: 6.66666666666666630e-01
element (1,2): value: 5.99999999999999978e-01
element (0,3): value: 5.99999999999999978e-01
element (1,3): value: 8.57142857142857095e-01
element (2,3): value: 5.99999999999999978e-01
metrics checksum 0-82898256547-645082804690974176 ctime 0.003131 ops 8.000000e+01 ops_rate 2.554971e+04 ops_rate/proc 2.554971e+04 vcmp 6.000000e+00 cmp 1.200000e+01 ecmp 1.200000e+01 ecmp_rate 3.832456e+03 ecmp_rate/proc 3.832456e+03 vctime 0.000022 mctime 0.000023 cktime 0.000039 intime 0.000069 outtime 0.000044 cpumem 6.720000e+02 gpumem 4.480000e+02 tottime 0.003433

#--------------------
# Larger case, multiple GPUs.
#--------------------

env OMP_NUM_THREADS=7 $AR_PERFORMANCE_FLAGS \
jsrun --nrs 12 --rs_per_host 6 \
  --tasks_per_rs 1 --cpu_per_rs 7 --bind packed:7 --gpu_per_rs 1 -X 1 \
  $EXECUTABLE \
    --num_field 20000 --num_vector 150000 --num_proc_vector 10 \
    --metric_type czekanowski --num_way 2 \
    --compute_method GPU --all2all yes --verbosity 1

metrics checksum 95-731337953569731265-108597638988176688 ctime 27.617216 ops 4.950330e+14 ops_rate 1.792480e+13 ops_rate/proc 1.792480e+12 vcmp 1.124992e+10 cmp 2.249985e+14 ecmp 2.249985e+14 ecmp_rate 8.147038e+12 ecmp_rate/proc 8.147038e+11 vctime 0.015438 mctime 0.676193 cktime 50.526713 intime 1.086587 outtime 0.000027 cpumem 3.300012e+10 gpumem 1.080000e+10 tottime 79.922458

#--------------------
# 3-way Czekanowski metric, small case.
#--------------------

env OMP_NUM_THREADS=7 $AR_PERFORMANCE_FLAGS \
jsrun --nrs 1 --rs_per_host 1 \
  --tasks_per_rs 1 --cpu_per_rs 7 --bind packed:7 --gpu_per_rs 1 -X 1 \
  $EXECUTABLE \
    --num_field 2 --num_vector 4 --num_proc_vector 1 \
    --metric_type czekanowski --num_way 3 \
    --compute_method GPU --all2all yes --verbosity 3

vec_proc 0 vec 0 field_proc 0 field 0 value 1.000000e+00
vec_proc 0 vec 0 field_proc 0 field 1 value 2.000000e+00
vec_proc 0 vec 1 field_proc 0 field 0 value 3.000000e+00
vec_proc 0 vec 1 field_proc 0 field 1 value 4.000000e+00
vec_proc 0 vec 2 field_proc 0 field 0 value 2.000000e+00
vec_proc 0 vec 2 field_proc 0 field 1 value 1.000000e+00
vec_proc 0 vec 3 field_proc 0 field 0 value 4.000000e+00
vec_proc 0 vec 3 field_proc 0 field 1 value 3.000000e+00
element (0,1,2): value: 6.92307692307692291e-01
element (0,1,3): value: 7.94117647058823484e-01
element (0,2,3): value: 6.92307692307692291e-01
element (1,2,3): value: 7.94117647058823484e-01
metrics checksum 0-84749404949-5538720434861296 ctime 0.003335 ops 1.280000e+02 ops_rate 3.838082e+04 ops_rate/proc 3.838082e+04 vcmp 4.000000e+00 cmp 8.000000e+00 ecmp 8.000000e+00 ecmp_rate 2.398801e+03 ecmp_rate/proc 2.398801e+03 vctime 0.000021 mctime 0.000037 cktime 0.000044 intime 0.000066 outtime 0.000040 cpumem 1.440000e+03 gpumem 9.600000e+02 tottime 0.003641

#--------------------
# 2-way CCC metric, small case.
#--------------------

env OMP_NUM_THREADS=7 $AR_PERFORMANCE_FLAGS \
jsrun --nrs 1 --rs_per_host 1 \
  --tasks_per_rs 1 --cpu_per_rs 7 --bind packed:7 --gpu_per_rs 1 -X 1 \
  $EXECUTABLE \
    --num_field 2 --num_vector 4 --num_proc_vector 1 \
    --metric_type ccc --num_way 2 \
    --compute_method GPU --all2all yes --verbosity 3

vec_proc 0 vec 0 field_proc 0 field 0 value 00
vec_proc 0 vec 0 field_proc 0 field 1 value 01
vec_proc 0 vec 1 field_proc 0 field 0 value 10
vec_proc 0 vec 1 field_proc 0 field 1 value 11
vec_proc 0 vec 2 field_proc 0 field 0 value 01
vec_proc 0 vec 2 field_proc 0 field 1 value 00
vec_proc 0 vec 3 field_proc 0 field 0 value 11
vec_proc 0 vec 3 field_proc 0 field 1 value 10
element (0,1): values: 0 0 4.68750000000000000e-01 0 1 5.62500000000000000e-01 1 0 0.00000000000000000e+00 1 1 4.68750000000000000e-01
element (0,2): values: 0 0 5.62500000000000000e-01 0 1 4.68750000000000000e-01 1 0 4.68750000000000000e-01 1 1 0.00000000000000000e+00
element (1,2): values: 0 0 2.34375000000000000e-01 0 1 3.90625000000000000e-01 1 0 7.03125000000000000e-01 1 1 2.34375000000000000e-01
element (0,3): values: 0 0 2.34375000000000000e-01 0 1 7.03125000000000000e-01 1 0 3.90625000000000000e-01 1 1 2.34375000000000000e-01
element (1,3): values: 0 0 0.00000000000000000e+00 0 1 4.68750000000000000e-01 1 0 4.68750000000000000e-01 1 1 5.62500000000000000e-01
element (2,3): values: 0 0 4.68750000000000000e-01 0 1 5.62500000000000000e-01 1 0 0.00000000000000000e+00 1 1 4.68750000000000000e-01
metrics checksum 0-245201878478-801640733671948288 ctime 0.003012 ops 0.000000e+00 ops_rate 0.000000e+00 ops_rate/proc 0.000000e+00 vcmp 6.000000e+00 cmp 4.800000e+01 ecmp 1.200000e+01 ecmp_rate 3.984141e+03 ecmp_rate/proc 3.984141e+03 vctime 0.000022 mctime 0.000021 cktime 0.000030 intime 0.000071 outtime 0.000115 cpumem 1.024000e+03 gpumem 7.040000e+02 tottime 0.003367

#--------------------
# Larger case, using tensor cores.
#--------------------

env OMP_NUM_THREADS=7 $AR_PERFORMANCE_FLAGS \
jsrun --nrs 12 --rs_per_host 6 \
  --tasks_per_rs 1 --cpu_per_rs 7 --bind packed:7 --gpu_per_rs 1 -X 1 \
  $EXECUTABLE \
    --num_field 400000 --num_vector 100000 --num_proc_vector 10 \
    --metric_type ccc --num_way 2 \
    --compute_method GPU --tc 1 --num_tc_steps 4 --all2all yes --verbosity 1

metrics checksum 138-217645116196906322-630503947831869440 ctime 23.484884 ops 1.760000e+16 ops_rate 7.494182e+14 ops_rate/proc 7.494182e+13 vcmp 4.999950e+09 cmp 7.999920e+15 ecmp 1.999980e+15 ecmp_rate 8.516031e+13 ecmp_rate/proc 8.516031e+12 vctime 0.209915 mctime 0.434914 cktime 150.661019 intime 15.280343 outtime 0.000036 cpumem 2.480000e+10 gpumem 1.420256e+10 tottime 190.071566

#--------------------
# 3-way CCC metric, small case.
#--------------------

env OMP_NUM_THREADS=7 $AR_PERFORMANCE_FLAGS \
jsrun --nrs 1 --rs_per_host 1 \
  --tasks_per_rs 1 --cpu_per_rs 7 --bind packed:7 --gpu_per_rs 1 -X 1 \
  $EXECUTABLE \
    --num_field 2 --num_vector 4 --num_proc_vector 1 \
    --metric_type ccc --num_way 3 \
    --compute_method GPU --all2all yes --verbosity 3

vec_proc 0 vec 0 field_proc 0 field 0 value 00
vec_proc 0 vec 0 field_proc 0 field 1 value 01
vec_proc 0 vec 1 field_proc 0 field 0 value 10
vec_proc 0 vec 1 field_proc 0 field 1 value 11
vec_proc 0 vec 2 field_proc 0 field 0 value 01
vec_proc 0 vec 2 field_proc 0 field 1 value 00
vec_proc 0 vec 3 field_proc 0 field 0 value 11
vec_proc 0 vec 3 field_proc 0 field 1 value 10
element (0,1,2): values: 0 0 0 1.17187500000000000e-01 0 0 1 1.95312500000000000e-01 0 1 0 2.10937500000000000e-01 0 1 1 1.17187500000000000e-01 1 0 0 0.00000000000000000e+00 1 0 1 0.00000000000000000e+00 1 1 0 2.34375000000000000e-01 1 1 1 0.00000000000000000e+00
element (0,1,3): values: 0 0 0 0.00000000000000000e+00 0 0 1 2.34375000000000000e-01 0 1 0 1.17187500000000000e-01 0 1 1 2.10937500000000000e-01 1 0 0 0.00000000000000000e+00 1 0 1 0.00000000000000000e+00 1 1 0 1.95312500000000000e-01 1 1 1 1.17187500000000000e-01
element (0,2,3): values: 0 0 0 1.17187500000000000e-01 0 0 1 2.10937500000000000e-01 0 1 0 0.00000000000000000e+00 0 1 1 2.34375000000000000e-01 1 0 0 1.95312500000000000e-01 1 0 1 1.17187500000000000e-01 1 1 0 0.00000000000000000e+00 1 1 1 0.00000000000000000e+00
element (1,2,3): values: 0 0 0 0.00000000000000000e+00 0 0 1 1.17187500000000000e-01 0 1 0 0.00000000000000000e+00 0 1 1 1.95312500000000000e-01 1 0 0 2.34375000000000000e-01 1 0 1 2.10937500000000000e-01 1 1 0 0.00000000000000000e+00 1 1 1 1.17187500000000000e-01
metrics checksum 0-20352394550-918734323983581184 ctime 0.003430 ops 0.000000e+00 ops_rate 0.000000e+00 ops_rate/proc 0.000000e+00 vcmp 4.000000e+00 cmp 6.400000e+01 ecmp 8.000000e+00 ecmp_rate 2.332274e+03 ecmp_rate/proc 2.332274e+03 vctime 0.000021 mctime 0.000041 cktime 0.000054 intime 0.000062 outtime 0.000133 cpumem 1.472000e+03 gpumem 8.320000e+02 tottime 0.003840