Subcommand: multiplicity

Edit the multiplicities of queries in jplace files.

Usage: gappa edit multiplicity [options]

Options

Input
`--jplace-path`	Required. `TEXT:PATH(existing)=[] ...` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
`--multiplicity-file`	`TEXT:FILE Excludes: --fasta-path --write-multiplicity-file` File containing a tab-separated list of [sample name,] query name, and multiplicity.
`--fasta-path`	`TEXT:PATH(existing)=[] ... Excludes: --multiplicity-file --write-multiplicity-file` List of fasta files or directories to process. For directories, only files with the extension `.(fasta\|fas\|fsa\|fna\|ffn\|faa\|frn)[.gz]` are processed.
`--keep-full-label`	`FLAG Needs: --fasta-path` If fasta files are used, keep their whole label as the name for jplace pqueries, instead of removing the abundance annotation.
Output
`--write-multiplicity-file`	`FLAG Excludes: --multiplicity-file --fasta-path` Do not change the existing multiplicities, but instead produce a file that lists them.
`--out-dir`	`TEXT=.` Directory to write output files to.
`--file-prefix`	`TEXT` File prefix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--file-suffix`	`TEXT` File suffix for output files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--compress`	`FLAG` If set, compress the output files using gzip. Output file extensions are automatically extended by `.gz`.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

The command edits the multiplicities of jplace files and sets them to values given as input. The command takes one or more jplace files as input, as well as an input that lists the new multiplicities for each pquery in the jplace files. There are two ways of input for the new multiplicities:

--multiplicity-file: A simple tab-separated list for each pquery.
--fasta-path: A set of fasta files, from which the header information is used.

See below for the expected format for each. A file that can be used for the first way can be produced with the --write-multiplicity-file flag, as explained below.

Details

As defined in the specification of the jplace standard, each query in a jplace file can have multiple names associated with it. This is for example useful if there are duplicate sequences in the data, but which have different names in the original fasta file: If the sequences are identical, so will be their placements. It thus makes sense to summarize the placement positions, and store the list of names for these duplicates, instead of repeating all placements for every name again and again.

Furthermore, each such name can have a so called multiplicity, which can be understood as a form of weight for the name. This is for example useful if duplicate sequences in the original data also share the same name (e.g., the hash of the sequence). In this case, not only their placements are identical - so are their name. In order to not lose track of how often the sequence appeared in the original data, its multiplicity can be set accordingly in the jplace file.

The command edits the multiplicity for pqueries by setting them to given values. No other data of the input jplace files is changed. The files are not edited in place, but new files are written to the --out-dir, potentially prefixed by --file-prefix and --file-suffix.

`--multiplicity-file`

The simplest way to provide new multiplicities is via a list. This tab-separated list file can be given in two formats: with two columns, or with three columns.

Two columns are interpreted as "pquery name" and "new multiplicity". This also works when multiple jplace files are provided - but in this case, it might be better to use the three-column format, in order to avoid accidental duplicates.

Three columns are interpreted as "sample name", "pquery name", and "new multiplicity". The sample name is the file name of the jplace file without the .jplace extension:

p1z1r2	FUM0LCO01BV7G2	24
p1z1r2	FUM0LCO01DOIHD	31
p1z1r2	FUM0LCO01CKWR0	5
...

Entries in the table can be wrapped in double quotation marks ("...") if they contain tab characters themselves. If duplicates occur, a warning is printed, and the last multiplicity value for a given pquery name is used. The provided multiplicities can be floating point numbers (e.g., 3.14).

`--fasta-path`

In many pre-processing pipelines, identical sequences are deduplicated prior to analyses to reduce overhead. See for example vsearch for a tool to achieve this. Such tools can annotate the resulting reduced files in order to keep track of the original number of identical sequences (their "abundance"). One popular way is to annotate the sequence label in its fasta file like this:

>FUM0LCO01BV7G2;size=24;
ACGT
>FUM0LCO01DOIHD;size=31;
GATACA
>FUM0LCO01CKWR0;size=5;
CATTAG
...

This information can be used here to set multiplicities. The command expects the base name of the fasta files (that is, without the .fasta or .fasta.gz extension) to be identical to the base name of the corresponding .jplace (or .jplace.gz) file, in order to know which multiplicities to use for which sample.

The following annotation formats are supported:

Via the >name;size=123; annotation.
Via the >name;weight=3.14; annotation.
Via underscore at the end of the label: >name_123

The first and the last option are common annotations, see swarm for a popular OTU clustering tool that supports both of them. They expect integer numbers. In order to also support floating point numbers, we additionally allow to use the weight annotation, as shown above. Note that if both size and weight are provided, they are multiplied to get the final multiplicity for the pquery.

By default, the pquery name is assumed to be just the first part of the fasta label, that is, the above annotations (and, if present, other semicolon-separated attributes) are removed. However, typical placement programs do not remove this information, but rather name the pquery using the full fasta label. Hence, the pquery name in a jplace file might be FUM0LCO01BV7G2;size=24;. In order to use this full label for finding pqueries, set the --keep-full-label flag.

`--write-multiplicity-file`

If set, a file listing the current multiplicities of the pqueries in the input jplace files is written. That is, no new jplace files are produced. The values in the file can then be changed as needed, and the file can be used as input to --multiplicity-file for actually changing the multiplicities in the jplace files. The file always uses the three columns format as explained above; the file is named multiplicities.csv, potentially prefixed by --file-prefix and --file-suffix.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070