docs: Reorganise and update them.

wwood · Apr 26, 2024 · 5f6c542 · 5f6c542
1 parent 0e5f187
commit 5f6c542
Show file tree

Hide file tree

Showing 15 changed files with 374 additions and 158 deletions.
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -0,0 +1,24 @@
+# FAQ
+#### Can you target the 16S rRNA gene instead of the default set of single copy marker genes with SingleM?
+Yes. By default, SingleM builds OTU tables from protein genes rather than 16S because this in general gives more strain-level resolution due to redundancy in the genetic code. If you are really keen on using 16S, then you can use SingleM with a 16S SingleM package (spkg). There is a [repository of auxiliary packages](https://github.com/wwood/singlem_extra_packages) at which includes a 16S package that is suitable for this purpose. The resolution won't be as high taxonomically, and there are issues around copy number variation, but it could be useful to use 16S for various reasons e.g. linking it to an amplicon study or using the GreenGenes taxonomy. For now there's no 16S spkg that gets installed by default, you have to use the `--singlem-packages` flag in `pipe` mode pointing to a separately downloaded package - see [https://github.com/wwood/singlem_extra_packages](https://github.com/wwood/singlem_extra_packages). Searching for 16S reads is also much slower than searching for protein-encoding reads.
+
+#### How should SingleM be run on multiple samples?
+There are two ways. It is possible to specify multiple input files to the `singlem pipe` subcommand directly by space separating them. Alternatively `singlem pipe` can be run on each sample and OTU tables combined using `singlem summarise`. The results should be identical, though there are some performance trade-offs. For large numbers of metagenomes (>100) it is probably preferable to run each sample individually in smaller groups.
+
+Note that the performance of a single `pipe` when run on many genomes drastically improved in version 0.17.0, and it now sensible to run up to 10,000 genomes at a time.
+
+#### What is the difference between the `num_hits` and `coverage` columns in the OTU table and taxonomic profiles generated by the pipe mode?
+`num_hits` is the number of reads found from the sample in that OTU. The
+`coverage` is the expected coverage of a genome with that OTU sequence i.e. the
+average number of bases covering each position in a genome after read mapping.
+This is calculated from `num_hits`. In particular, `num_hits` is the 'kmer
+coverage' formula used by genome assembly programs, and so `coverage` is
+calculated according to the following formula, adapted from the one given in
+the Velvet assembler's
+[manual](https://raw.githubusercontent.com/dzerbino/velvet/master/Manual.pdf):
+
+```
+coverage = num_hits * L / (L - k + 1)
+```
+
+Where `L` is the length of a read and `k` is the length of the OTU sequence including inserts and gaps (usually `60` bp).
diff --git a/docs/Glossary.md b/docs/Glossary.md
@@ -0,0 +1,55 @@
+# Glossary
+
+## **Taxonomic profile**
+A tab-separated table containing the estimated abundances of GTDB taxons in a metagenome. It is in TSV format with 3 columns, with each row corresponding to a taxon. A taxonomic profile may also be called a **condensed profile**, since it is the output of the `condense` algorithm within the main `pipe` workflow. Taxonomic profiles can be converted to other formats using `singlem summarise`. Columns:
+  1. sample name. A taxonomic profile can consist of more than one sample. Usually all the taxons in the first sample are listed, and then the taxons in the second sample, and so on.
+  2. coverage of that taxon. This is an approximation of the total read coverage of all genomes from this taxon. However, note that this coverage does not include the coverage of sub-taxons. For instance, the coverage of a species is not included in the coverage shown for its genus.
+  3. taxonomy string of the taxon
+```
+sample	coverage	taxonomy
+ERR1914274	0	Root
+ERR1914274	3.16	Root; d__Bacteria
+ERR1914274	0	Root; d__Bacteria; p__Pseudomonadota
+ERR1914274	0.06	Root; d__Bacteria; p__Pseudomonadota; c__Gammaproteobacteria
+ERR1914274	0	Root; d__Bacteria; p__Bacillota_A
+ERR1914274	0.61	Root; d__Bacteria; p__Bacillota_A; c__Clostridia
+ERR1914274	0	Root; d__Bacteria; p__Bacteroidota
+ERR1914274	0.39	Root; d__Bacteria; p__Bacteroidota; c__Bacteroidia
+ERR1914274	0	Root; d__Bacteria; p__Bacillota
+...
+```
+
+## **OTU table**
+A table containing window sequences per metagenome/contig and marker gene. It may be in default form (a TSV with 6 columns, like below), or an extended form with more detail in further columns. The default OTU table output from [pipe](/tools/pipe), [renew](/tools/renew) and [summarise](/tools/summarise) subcommands has 6 columns, with one sequence per row. The extended form OTU table and archive OTU tables have further information (see below). Columns of a default OTU table:
+  1. marker name
+  2. sample name
+  3. sequence of the OTU
+  4. number of reads detected from that OTU
+  5. estimated coverage of a genome from this OTU
+  6. "median" taxonomic classification of each of the reads in the OTU i.e. the most specific taxonomy that 50%+ of the reads agree with.
+```
+gene    sample  sequence        num_hits        coverage        taxonomy
+4.21.ribosomal_protein_S19_rpsS my_sequences  TGGTCGCGCCGTTCGACGGTCACTCCGGACTTCATCGGCCTACAGTTCGCCGTGCACATC    1       1.64    Root; d__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Desulfuromonadales
+4.21.ribosomal_protein_S19_rpsS my_sequences  TGGTCGCGGCGCTCAACCATTCTGCCCGAGTTCGTCGGCCACACCGTGGCCGTTCACAAC    1       1.64    Root; d__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__Candidatus_Solibacter; s__Candidatus_Solibacter_usitatus
+```
+
+## **OTU table (extended form)**
+The extended OTU table form generated with the `--output-extras` option to the [pipe](/tools/pipe), [renew](/tools/renew) and [summarise](/tools/summarise) subcommands, has all the columns of a regular OTU table, but with several additional columns which contain more information about each OTU:
+  1. read_names - the names of the reads which encode the OTU sequence
+  2. nucleotides_aligned - the number of nucleotides which aligned to the window (usually 60, but can be more or less if there are gaps or inserts)
+  3. taxonomy_by_known? - whether the taxonomy of the OTU was determined by known genomes (TRUE) or by the reads themselves (FALSE). Currently this is a disused column and is always marked FALSE.
+  4. read_unaligned_sequences - the raw sequences of the reads which encode the OTU sequence
+  5. equal_best_hit_taxonomies - the taxonomies of the best hits to the OTU sequence, if there are multiple equally good hits. This is a JSON array of strings.
+
+
+## **Archive OTU table**
+Similar to an extended form OTU table, but in JSON form for machine readability and with formatting version recorded. The [renew](/tools/renew) subcommand which re-analyses a dataset requires this format of OTU table rather than the default tab-separated OTU table format. The canonical file extension for SingleM packages is `.json`.
+
+## **SingleM package (spkg)** 
+Reference data for one particular marker gene and its window position. The canonical file extension for SingleM packages is `.spkg`.
+
+## **SingleM metapackage (smpkg)** 
+A collection of SingleM packages, with additional indices. The canonical file extension for SingleM metapackages is `.smpkg`.
+
+## **SingleM database** 
+An OTU table which has been converted to SQLite3 format and sequence similarity search indexes. Canonically SingleM databases are named with the `.sdb` extension, but this is not enforced. SingleM databases are created with the [makedb](/advanced/makedb) subcommand, and queried with the [query](/advanced/query) subcommand.
diff --git a/docs/Installation.md b/docs/Installation.md
@@ -84,4 +84,8 @@ export PATH=$PWD:$PATH
 singlem -h
 ```
 
-After this, you'll also need to procure the reference data (the "metapackage"). See [singlem data](/tools/data).
+After this, you'll also need to procure the reference data (the "metapackage"). See [singlem data](/tools/data).
+
+# Containerised SingleM installation examples
+
+To ensure that the instructions here work, they have been tested in containerised environments. Logs of this procedure are available at https://github.com/wwood/singlem-installation.
diff --git a/docs/README.md b/docs/README.md
@@ -29,76 +29,6 @@ And more specialised / expert modes:
 ## Help
 If you have any questions or comments, raise a [GitHib issue](https://github.com/wwood/singlem/issues) or just send us an [email](https://research.qut.edu.au/cmr/team/ben-woodcroft/).
 
-### Glossary
-
-* **Taxonomic profile** - A tab-separated table containing the estimated abundances of GTDB taxons in a metagenome. It is in TSV format with 3 columns, with each row corresponding to a taxon. A taxonomic profile may also be called a **condensed profile**, since it is the output of the `condense` algorithm within the main `pipe` workflow. Taxonomic profiles can be converted to other formats using `singlem summarise`. Columns:
-  1. sample name. A taxonomic profile can consist of more than one sample. Usually all the taxons in the first sample are listed, and then the taxons in the second sample, and so on.
-  2. coverage of that taxon. This is an approximation of the total read coverage of all genomes from this taxon. However, note that this coverage does not include the coverage of sub-taxons. For instance, the coverage of a species is not included in the coverage shown for its genus.
-  3. taxonomy string of the taxon
-```
-sample	coverage	taxonomy
-ERR1914274	0	Root
-ERR1914274	3.16	Root; d__Bacteria
-ERR1914274	0	Root; d__Bacteria; p__Pseudomonadota
-ERR1914274	0.06	Root; d__Bacteria; p__Pseudomonadota; c__Gammaproteobacteria
-ERR1914274	0	Root; d__Bacteria; p__Bacillota_A
-ERR1914274	0.61	Root; d__Bacteria; p__Bacillota_A; c__Clostridia
-ERR1914274	0	Root; d__Bacteria; p__Bacteroidota
-ERR1914274	0.39	Root; d__Bacteria; p__Bacteroidota; c__Bacteroidia
-ERR1914274	0	Root; d__Bacteria; p__Bacillota
-...
-```
-
-* **OTU table** - A table containing window sequences per metagenome/contig and marker gene. It may be in default form (a TSV with 6 columns, like below), or an extended form with more detail in further columns. The default OTU table output from [pipe](/tools/pipe), [renew](/tools/renew) and [summarise](/tools/summarise) subcommands has 6 columns, with one sequence per row. The extended form OTU table and archive OTU tables have further information (see below). Columns of a default OTU table:
-  1. marker name
-  2. sample name
-  3. sequence of the OTU
-  4. number of reads detected from that OTU
-  5. estimated coverage of a genome from this OTU
-  6. "median" taxonomic classification of each of the reads in the OTU i.e. the most specific taxonomy that 50%+ of the reads agree with.
-```
-gene    sample  sequence        num_hits        coverage        taxonomy
-4.21.ribosomal_protein_S19_rpsS my_sequences  TGGTCGCGCCGTTCGACGGTCACTCCGGACTTCATCGGCCTACAGTTCGCCGTGCACATC    1       1.64    Root; d__Bacteria; p__Proteobacteria; c__Deltaproteobacteria; o__Desulfuromonadales
-4.21.ribosomal_protein_S19_rpsS my_sequences  TGGTCGCGGCGCTCAACCATTCTGCCCGAGTTCGTCGGCCACACCGTGGCCGTTCACAAC    1       1.64    Root; d__Bacteria; p__Acidobacteria; c__Solibacteres; o__Solibacterales; f__Solibacteraceae; g__Candidatus_Solibacter; s__Candidatus_Solibacter_usitatus
-```
-
-* **OTU table (extended form)** The extended OTU table form generated with the `--output-extras` option to the [pipe](/tools/pipe), [renew](/tools/renew) and [summarise](/tools/summarise) subcommands, has all the columns of a regular OTU table, but with several additional columns which contain more information about each OTU:
-  1. read_names - the names of the reads which encode the OTU sequence
-  2. nucleotides_aligned - the number of nucleotides which aligned to the window (usually 60, but can be more or less if there are gaps or inserts)
-  3. taxonomy_by_known? - whether the taxonomy of the OTU was determined by known genomes (TRUE) or by the reads themselves (FALSE). Currently this is a disused column and is always marked FALSE.
-  4. read_unaligned_sequences - the raw sequences of the reads which encode the OTU sequence
-  5. equal_best_hit_taxonomies - the taxonomies of the best hits to the OTU sequence, if there are multiple equally good hits. This is a JSON array of strings.
-
-
-* **Archive OTU table** - Similar to an extended form OTU table, but in JSON form for machine readability and with formatting version recorded. The [renew](/tools/renew) subcommand which re-analyses a dataset requires this format of OTU table rather than the default tab-separated OTU table format. The canonical file extension for SingleM packages is `.json`.
-* **SingleM package (spkg)** - Reference data for one particular marker gene and its window position. The canonical file extension for SingleM packages is `.spkg`.
-* **SingleM metapackage** - A collection of SingleM packages, with additional indices. The canonical file extension for SingleM metapackages is `.smpkg`.
-* **SingleM database** - An OTU table which has been converted to SQLite3 format and sequence similarity search indexes. Canonically SingleM databases are named with the `.sdb` extension, but this is not enforced. SingleM databases are created with the [makedb](/advanced/makedb) subcommand, and queried with the [query](/advanced/query) subcommand.
-
-### FAQ
-#### Can you target the 16S rRNA gene instead of the default set of single copy marker genes with SingleM?
-Yes. By default, SingleM builds OTU tables from protein genes rather than 16S because this in general gives more strain-level resolution due to redundancy in the genetic code. If you are really keen on using 16S, then you can use SingleM with a 16S SingleM package (spkg). There is a [repository of auxiliary packages](https://github.com/wwood/singlem_extra_packages) at which includes a 16S package that is suitable for this purpose. The resolution won't be as high taxonomically, and there are issues around copy number variation, but it could be useful to use 16S for various reasons e.g. linking it to an amplicon study or using the GreenGenes taxonomy. For now there's no 16S spkg that gets installed by default, you have to use the `--singlem-packages` flag in `pipe` mode pointing to a separately downloaded package - see [https://github.com/wwood/singlem_extra_packages](https://github.com/wwood/singlem_extra_packages). Searching for 16S reads is also much slower than searching for protein-encoding reads.
-
-#### How should SingleM be run on multiple samples?
-There are two ways. It is possible to specify multiple input files to the `singlem pipe` subcommand directly by space separating them. Alternatively `singlem pipe` can be run on each sample and OTU tables combined using `singlem summarise`. The results should be identical, though there are some performance trade-offs. For large numbers of samples (>100) it is probably preferable to run each sample individually or in smaller groups.
-
-#### What is the difference between the num_hits and coverage columns in the OTU table generated by the pipe mode?
-`num_hits` is the number of reads found from the sample in that OTU. The
-`coverage` is the expected coverage of a genome with that OTU sequence i.e. the
-average number of bases covering each position in a genome after read mapping.
-This is calculated from `num_hits`. In particular, `num_hits` is the 'kmer
-coverage' formula used by genome assembly programs, and so `coverage` is
-calculated according to the following formula, adapted from the one given in
-the Velvet assembler's
-[manual](https://raw.githubusercontent.com/dzerbino/velvet/master/Manual.pdf):
-
-```
-coverage = num_hits * L / (L - k + 1)
-```
-
-Where `L` is the length of a read and `k` is the length of the OTU sequence including inserts and gaps (usually `60` bp).
-
-
 ## License
 SingleM is developed by the [Woodcroft lab](https://research.qut.edu.au/cmr/team/ben-woodcroft/) at the [Centre for Microbiome Research](https://research.qut.edu.au/cmr), School of Biomedical Sciences, QUT, with contributions several including [Samuel Aroney](https://github.com/AroneyS) and [Rossen Zhao](https://github.com/rzhao-2) and many others. It is licensed under [GPL3 or later](https://gnu.org/licenses/gpl.html).
 

diff --git a/docs/advanced/condense.md b/docs/advanced/condense.md
@@ -7,7 +7,9 @@ DESCRIPTION
 ===========
 
 Combine OTU tables across different markers into a single taxonomic
-profile.
+profile. Note that while this mode can be run independently, it is often
+more straightforward to invoke its methodology by specifying -p /
+\--taxonomic- profile when running pipe mode.
 
 OPTIONS
 =======

diff --git a/docs/advanced/create.md b/docs/advanced/create.md
@@ -30,7 +30,10 @@ OPTIONS
 **\--hmm-position** INTEGER
 
   Position in the GraftM alignment HMM where the SingleM window
-    starts. To choose the best position, use \'singlem seqs\'.
+    starts. To choose the best position, use \'singlem seqs\'. Note that
+    this position (both the one output by \'seqs\' and the one specified
+    here) is a 1-based index, but this positions stored within the
+    SingleM package as a 0-based index.
 
 **\--window-size** INTEGER
 

diff --git a/docs/advanced/metapackage.md b/docs/advanced/metapackage.md
@@ -11,6 +11,10 @@ Create or describe a metapackage (i.e. set of SingleM packages)
 OPTIONS
 =======
 
+**\--metapackage** *METAPACKAGE*
+
+  Path to write generated metapackage to
+
 **\--singlem-packages** *SINGLEM_PACKAGES* [*SINGLEM_PACKAGES* \...]
 
   Input packages
@@ -31,9 +35,26 @@ OPTIONS
 
   Skip taxon genome lengths
 
-**\--metapackage** *METAPACKAGE*
+**\--taxonomy-database-name** *TAXONOMY_DATABASE_NAME*
 
-  Path to write generated metapackage to
+  Name of the taxonomy database to use [default:
+    custom_taxonomy_database]
+
+**\--taxonomy-database-version** *TAXONOMY_DATABASE_VERSION*
+
+  Version of the taxonomy database to use [default: unspecified]
+
+**\--diamond-prefilter-performance-parameters** *DIAMOND_PREFILTER_PERFORMANCE_PARAMETERS*
+
+  Performance-type arguments to use when calling \'diamond blastx\'
+    during the prefiltering. [default: \'\--block-size 0.5
+    \--target-indexed -c1\']
+
+**\--diamond-taxonomy-assignment-performance-parameters** *DIAMOND_TAXONOMY_ASSIGNMENT_PERFORMANCE_PARAMETERS*
+
+  Performance-type arguments to use when calling \'diamond blastx\'
+    during the taxonomy assignment. [default: \'\--block-size 0.5
+    \--target-indexed -c1\']
 
 **\--describe**
 
@@ -52,6 +73,11 @@ OPTIONS
   Dereplicated DIAMOND db for prefilter to use [default: dereplicate
     from input SingleM packages]
 
+**\--makeidx-sensitivity-params** PARAMS
+
+  DIAMOND sensitivity parameters to use when indexing the prefilter
+    DIAMOND db. [default: None]
+
 OTHER GENERAL OPTIONS
 =====================
 

diff --git a/docs/advanced/seqs.md b/docs/advanced/seqs.md
@@ -28,6 +28,11 @@ OPTIONS
 
   Number of nucleotides to use in continuous window [default: 60]
 
+**\--hmm** *HMM*
+
+  HMM file used to generate alignment, used here to rank windows
+    according to their information content.
+
 OTHER GENERAL OPTIONS
 =====================