Merge branch 'master' into dependabot/github_actions/jlumbroso/free-d…

…isk-space-1.3.1
IKIM-Essen · Apr 9, 2024 · 3c0250b · 3c0250b
2 parents 511d4fd + 20a8034
commit 3c0250b
Show file tree

Hide file tree

Showing 15 changed files with 532 additions and 179 deletions.
diff --git a/.github/workflows/check-todos.yml b/.github/workflows/check-todos.yml
@@ -11,4 +11,4 @@ jobs:
     steps:
       - uses: "actions/checkout@master"
       - name: "TODO to Issue"
-        uses: "alstr/todo-to-issue-action@v4.12"
+        uses: "alstr/todo-to-issue-action@v4.13"
diff --git a/docs/about.md b/docs/about.md
diff --git a/docs/assets/Report_panel.png b/docs/assets/Report_panel.png
diff --git a/docs/assets/UnCoVar_favicon.png b/docs/assets/UnCoVar_favicon.png
diff --git a/docs/assets/UnCoVar_virus_white_no_shadow.png b/docs/assets/UnCoVar_virus_white_no_shadow.png
diff --git a/docs/assets/UnCoVar_wf_new.png b/docs/assets/UnCoVar_wf_new.png
diff --git a/docs/configuration.md b/docs/configuration.md
@@ -1,8 +1,13 @@
-# Configuration
+# Advanced Configuration
+
+The config file, found under `config/config.yaml` can be used to adapt your analysis.
 
 ## Execution Mode
 
-Accepted values: `patient`, `environment`. Defaults to `patient`.
+```yaml
+# execution mode. Can be either "patient" or "environment"
+mode: environment
+```
 
 Defines the execution mode of UnCoVar.
 
@@ -15,73 +20,124 @@ environment (e.g. wastewater) and to contain different SARS-CoV-2 strains.
 The parts of the workflow responsible for creating and analysing individual
 genomes (e.g. assembly, lineage calling via Pangolin) are disabled.
 
-## Adapters
+## Sending lab number
 
-There are three ways to transfer adapter sequences to UnCoVar to remove them
-from the raw data.
+UnCoVar automatically generates a multi-Fasta file and a corresponding `.csv` for
+ all samples with a `1` flag for `inlcude_in_high_genome_summary` in the sample sheet,
+ that match the given `quality-criteria` (see below). The reporting format and the
+ quality criteria are inspired by the [requirements for SARS-CoV-2 genome submission](https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/DESH/Qualitaetskriterien.pdf?__blob=publicationFile)
+ to the [Robert-Koch-Institute, Germany](https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/nCoV.html).
+ The sending lab number will be included in the `.csv` file
 
-### Config File
+## Data handling
 
-The adapter sequences used can be specified in the config file under
-`preprocessing` -> `kit adapters`.
+With the root of the UnCoVar workflow as working directory, we recommended to
+ use the following folder structure:
 
-For **paired-end data**, the adapters can be detected by per-read overlap
-analysis, which seeks the overlap for each pair of reads. The adapter sequences
-can be specified for read one by `—adapter_sequence` and for
-read two by`—adapter_sequence_r2`. An example for [Illuminas TruSeq library] (<https://www.illumina.com/products/by-type/sequencing-kits/library-prep-kits/truseq-rna-v2.html>)
-is shown below:
-
-```yaml
-"--adapter_sequence = AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
---adapter_sequence_r2 = AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT”
+```text
+├── archive
+├── incoming
+└── uncovar
+    └── data
+        └── 2023-12-24
 ```
 
-Adapters for **single-end data** can be specified only using the
-`—adapter_sequence` option.
+The structure can be adjusted to via the config under `data-handling`:
 
 ```yaml
-"--adapter_sequence = AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
+data-handling:
+  # flag for using the following data-handling structure
+  # True: data-handling structure is used as shown below
+  # False: only the sample sheet needs to be updated (manually)
+  use-data-handling: True
+  # flag for archiving data
+  # True: data is archived in path defined below
+  # False: data is not archived
+  archive-data: False
+  # path of incoming data, which is moved to the
+  # data directory by the preprocessing script
+  incoming: ../incoming/
+  # path to store data within the workflow
+  data: data/
+  # path to archive data from incoming and
+  # the results from the latest run to
+  archive: ../archive/
 ```
 
-### Sample Sheet
-
-The second way to remove adapter sequences is to specify the adapter sequence
-per sample in the sample sheet. The adapters must be entered in a column
-called `adapters`. For paired-end and single-end format, see above. Here is
-an exemplary samples sheet:
-
-| sample_name | fq1         | fq2         | date       | is_amplicon_data | technology | adapters                                           |
-| ----------- | ----------- | ----------- | ---------- | ---------------- | ---------- | -------------------------------------------------- |
-| example-1   | PATH/TO/fq1 | PATH/TO/fq2 | 1970-01-01 | 1                | illumina   | --adapter_sequence=ACGT --adapter_sequence_r2=TGCA |
-| example-2   | PATH/TO/fq  |             | 1970-01-01 | 1                | ion        | --adapter_sequence=ACGT                            |
-
-If an adapter sequence is specified for a sample in the sample sheet, this
-adapter sequence is used to trim the sequences of only this sample. For
-empty entries, UnCoVar uses the adapter sequence from the config file.
-
-### Pre-Defined Adapters
-
-UnCoVar supports two different sequencing kits and their respective adapters,
-namely:
+## Quality criteria
 
-1. [Revelo RNA-Seq library preparation kit](https://lifesciences.tecan.com/revelo-rna-seq-library-prep-kit?p=tab--5)
-1. [EasySeq RC-PCR SARS CoV-2 Whole Genome Sequencing kit](https://www.nimagen.com/shop/products/rc-cov096/easyseq-sars-cov-2-novel-coronavirus-whole-genome-sequencing-kit)
+The quality criteria can be adjusted to your individual needs. By default they match
+ the quality criteria needed for submitting to the RKI (see **Sending lab number**
+ above)
 
-The `adapters` column in the sample sheet is used to trim the adapter sequences
-of these kits. Revelo adapters are trimmed by specifying
-`revelo-rna-seq` in the column per sample, while the Nimagen adapters are
-removed by specifying `nimagen-easy-seq`. A short example:
+```yaml
+quality-criteria:
+  illumina:
+    # minimal length of acceptable reads
+    min-length-reads: 30
+    # average quality of acceptable reads (PHRED)
+    min-PHRED: 20
+  ont:
+    # minimal length of acceptable reads
+    min-length-reads: 200
+    # average quality of acceptable reads (PHRED)
+    min-PHRED: 10
+  # identity to virus reference genome (see-above) of reconstructed sequence
+  min-identity: 0.9
+  # share N in the reconstructed sequence
+  max-n: 0.05
+  # minimum local sequencing depth without filtering of PCR duplicates
+  min-depth-with-PCR-duplicates: 20
+  # minimum local sequencing depth after filtering PCR duplicates
+  min-depth-without-PCR-duplicates: 10
+  # minimum informative allele frequency
+  min-allele: 0.9
+```
 
-| sample_name | fq1         | fq2         | date       | is_amplicon_data | technology | adapters         |
-| ----------- | ----------- | ----------- | ---------- | ---------------- | ---------- | ---------------- |
-| example-1   | PATH/TO/fq1 | PATH/TO/fq2 | 1970-01-01 | 0                | illumina   | revelo-rna-seq   |
-| example-2   | PATH/TO/fq  | PATH/TO/fq2 | 1970-01-01 | 1                | illumina   | nimagen-easy-seq |
+## Preprocessing
 
-### Customized Primer Removal
+Here different preprocessing can be adjustet. Per default the standard Illumina adapters
+ are trimmed. For samples prepared with an amplicon sequencing approach, you can
+ define the path to the primer file in `.bed` format. If you are processing Nanopore
+ samples, you can also define the primer version via changing the number.
 
 The default primer file is a bed file from the [ARTIC network](https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019/V3>).
 However, the primers for clipping can be customized. First, the custom primers must
 be saved in bed format. Next, the path to this file must be changed in the config.
 Go to the config folder and open config.yaml. In the "preprocessing" subcategory,
 change the path after "amplicon-primers" to the path where your primer file
 can be found.
+
+```yaml
+preprocessing:
+  # only for *non* Oxford Nanopore data. Adapters to trim.
+  # see: https://www.nimagen.com/shop/products/rc-cov096/easyseq-sars-cov-2-novel-coronavirus-whole-genome-sequencing-kit
+  kit-adapters: "--adapter_sequence GCGAATTTCGACGATCGTTGCATTAACTCGCGAA --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
+  # only for Oxford Nanopore data.
+  # ARTIC primer version to clip from reads. See
+  # https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019/V4
+  # for more information
+  artic-primer-version: 4
+  # path to amplicon primers in bed format for hard-clipping on paired end files (illumina) or url to file that should be downloaded
+  amplicon-primers: "resources/SARS-CoV-2-artic-v4_1.primer.bed"
+  # GenBank accession of reference sequence of the amplicon primers
+  amplicon-reference: "MN908947"
+```
+
+## Assembly
+
+In this section you define which assembler you want to use for the genome reconstruction.
+ UnCoVar uses MEGAHIT and metaSPAdes by default, as those achieved the best results
+ in a benchmarking comparison. The assembly options can be changed independently.
+
+There are several other options available:
+
+- megahit-std
+- megahit-meta-large
+- megahit-meta-sensitive
+- trinity
+- velvet
+- metaspades
+- coronaspades
+- spades
+- rnaviralspades
diff --git a/docs/faq.md b/docs/faq.md
diff --git a/docs/index.md b/docs/index.md
@@ -1,30 +1,38 @@
-# UnCoVar -- an open, extensible framework for virus genome analysis
+<h1>
+Workflow for Transparent and Robust Virus Variant Calling, Genome Reconstruction
+ and Lineage Assignment
+</h1>
 
-<picture>
-  <source media="(prefers-color-scheme: dark)" srcset="https://github.com/IKIM-Essen/uncovar/assets/77535027/8e17c6fc-ff7a-4c25-afc9-7888036d693e">
-  <source media="(prefers-color-scheme: light)" srcset="https://github.com/IKIM-Essen/uncovar/assets/77535027/c99f5a94-749b-422e-b319-1e3700d40a8e">
-  <img alt="UnCoVar Logo dark/light">
-</picture>
-
-[![Snakemake](https://img.shields.io/badge/snakemake-≥6.3.0-brightgreen.svg)](https://snakemake.bitbucket.io)
+[![Snakemake](https://img.shields.io/badge/snakemake-≥7.32.4-brightgreen.svg)](https://snakemake.bitbucket.io)
 [![GitHub actions status](https://github.com/koesterlab/snakemake-workflow-sars-cov2/workflows/Tests/badge.svg?branch=master)](https://github.com/koesterlab/snakemake-workflow-sars-cov2/actions?query=branch%3Amaster+workflow%3ATests)
-[![Docker Repository on Quay](https://quay.io/repository/uncovar/uncovar/status "Docker Repository on Quay")](https://quay.io/repository/uncovar/uncovar)
 
-A Reproducible and Scalable Workflow for Transparent and Robust Virus Variant Calling and Lineage Assignment using SARS-CoV-2 as an example.
+## Workflow Overview
 
-- Using state of the art tools, easily extended for other viruses
+<img src="./assets/UnCoVar_wf_new.png" alt="UnCoVar workflow" width="90%"/>
 
-![UnCoVar tools](./assets/tools.png)
+## Highlights
 
-- Tools and database updates for critical components via Conda
+- Using state of the art tools, easily extended for other viruses
 
-- Built using modern design patterns with Conda and SnakeMake
+- Tool and database updates for critical components via Conda
+
+- Built using modern design patterns with Conda and Snakemake
 
 - Extensible and easy to customize
 
+- Submission Ready Genomes
+
 - Customizable reporting with comprehensive visualization
 
-![UnCoVar visuals](./assets/uncovar-displays.png)
+![UnCoVar visuals](./assets/Report_panel.png)
 
-- Submission Ready Genomes
+Four different example elements of the results generated by UnCoVar:
+
+- a: The genome coverage of the aligned reads, visualized for multiple samples
+
+- b: evaluation of known protein alterations from VOCs for one sample
+
+- c: a pileup of reads at the position of one protein alteration. The mutations
+ observed for multiple reads (grey bars) for a single sample, here in the S gene
 
+- d: The lineage assignments inferred for single reads for one sample