Merge pull request #5 from phac-nml/update/clair3_params

Update Clair3 Params to be less strict
phac-nml · Jan 20, 2025 · 1969556 · 1969556
2 parents 69dd02f + 3c5566d
commit 1969556
Show file tree

Hide file tree

Showing 24 changed files with 291 additions and 138 deletions.
diff --git a/.github/test-data/nanopore/input.csv b/.github/test-data/nanopore/input.csv
@@ -1,6 +1,6 @@
-sample,reads
-sample1,.github/test-data/nanopore/fastq_pass/barcode01
-sample2,.github/test-data/nanopore/fastq_pass/barcode02
-100_reads,.github/test-data/nanopore/fastq_pass/barcode03
-25_reads,.github/test-data/nanopore/fastq_pass/barcode24
-negative-ctrl,.github/test-data/nanopore/fastq_pass/barcode94
+sample,fastq_1
+sample1,.github/test-data/nanopore/fastq_pass/barcode01/barcode01.fastq.gz
+sample2,.github/test-data/nanopore/fastq_pass/barcode02/barcode02.fastq.gz
+100_reads,.github/test-data/nanopore/fastq_pass/barcode03/barcode03.fastq.gz
+25_reads,.github/test-data/nanopore/fastq_pass/barcode24/barcode24.fastq.gz
+negative-ctrl,.github/test-data/nanopore/fastq_pass/barcode94/barcode94.fastq.gz
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,23 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## v1.1.0 - [Unreleased]
+
+### `Added`
+- Input schema JSON and validation
+- FORMAT_INPUT workflow
+    - Handles the input data now
+- `[email protected]` plugin
+
+### `Changed`
+- `--input SAMPLESHEET_CSV` header
+    - Went from `reads` with path to barcode directories to `fastq_1` with path to fastq files
+- Fixed bug so that SNPEff will now work with given gff files
+    - Issue was typo related in the build module
+- Fixed bug with `calc_bam_variation` caused by genome case
+- Log and error statements
+- Fixed the cache directory statements
+
 ## v1.0.0 - [2024-03-22]
 
 Initial release of `phac-nml/viralassembly`, created from combining the [nf-core](https://nf-co.re/) template with the artic steps.

diff --git a/README.md b/README.md
@@ -11,7 +11,7 @@ Some of the goals of this pipeline are:
 2. Allow the pipeline to be used on other viruses with or without amplicon schemes
     - Due to the QC steps there is unfortunately a current limitation at working with segmented viruses
         - The pipeline will automatically exit after assembly and not generate QC and Reports for these at this time
-        - This will be fully implemented at some point
+        - This will hopefully be fully implemented at some point in the future
 3. Provide `Run` level and `Sample` level final reports
 
 ## Index
@@ -32,7 +32,7 @@ Some of the goals of this pipeline are:
         - Conda command: `conda create on nextflow -c conda-forge -c bioconda nextflow`
     2. Install with the instructions at https://www.nextflow.io/
 
-2. Run the pipeline with a profile to handle dependencies:
+2. Run the pipeline with one of the following profiles to handle dependencies (or use your own profile if you have one!):
     - `conda`
     - `mamba`
     - `singularity`
@@ -43,8 +43,8 @@ Simple commands to run input data. Input data can be done in three different way
 1. Passing `--fastq_pass </PATH/TO/fastq_pass>` where `fastq_pass` is a directory containing `barcode##` subdirectories with fastq files
 2. Passing `--fastq_pass </PATH/TO/fastqs>` where `fastqs` is a directory containing `.fastq*` files
 3. Passing `--input <samplesheet.csv>` where `samplesheet.csv` is a CSV file with two columns
-    1. `sample` - The name of the sample
-    2. `reads`  - Path to a directory containing reads for the sample in `.fastq*` format
+    1. `sample`  - The name of the sample
+    2. `fastq_1` - Path to one fastq file per sample in `.fastq*` format
 
 The basic examples will show how to run the pipeline using the `--fastq_pass` input but it could be subbed in for the `--input` CSV file if wanted.
 
@@ -103,7 +103,7 @@ nextflow run /PATH/TO/artic-generic-nf/main.nf \
 Medaka model information [can be found here](https://github.com/nanoporetech/medaka#models)
 
 ### Nanopore - Nanopolish
-Running the pipeline with [nanopolish](https://github.com/jts/nanopolish) for variant calls requires fastq files, fast5 files, and the sequencing summary file. When running, the pipeline will look for subdirectories off of the input directory called `barcode##` to be used in the pipeline.
+Running the pipeline with [nanopolish](https://github.com/jts/nanopolish) for variant calls requires fastq files, fast5 files, and the sequencing summary file instead of providing a model. As such, nanopolish requires that the read ids in the fastq files are linked by the sequencing summary file to their signal-level data in the fast5 files. This makes it **a lot** easier to run using barcoded directories but it can be run with individual read files
 
 See the [nanopolish section](./docs/usage.md#nanopolish) of the usage docs for more information
 
@@ -173,8 +173,8 @@ Contributions are welcome through creating PRs or Issues
 ## Legal
 Copyright 2023 Government of Canada
 
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
+Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
 
-http://www.apache.org/licenses/LICENSE-2.0
+https://opensource.org/license/mit/
 
 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -0,0 +1,25 @@
+{
+    "$schema": "https://json-schema.org/draft/2020-12/schema",
+    "$id": "https://raw.githubusercontent.com/nf-schema/example/master/assets/schema_input.json",
+    "title": "nf-schema example - params.input schema",
+    "description": "Schema for the file provided with params.input",
+    "type": "array",
+    "items": {
+        "type": "object",
+        "properties": {
+            "sample": {
+                "type": "string",
+                "pattern": "^\\S+$",
+                "errorMessage": "Sample name must be provided and cannot contain spaces",
+                "meta": ["id"]
+            },
+            "fastq_1": {
+                "type": "string",
+                "pattern": "^\\S+\\.f(ast)?q(\\.gz)?$",
+                "format": "file-path",
+                "errorMessage": "FastQ file for reads must be provided, cannot contain spaces, and must have extension '.fq(.gz)' or '.fastq(.gz)'"
+            }
+        },
+        "required": ["sample", "fastq_1"]
+    }
+}
diff --git a/bin/calc_bam_variation.py b/bin/calc_bam_variation.py
@@ -86,6 +86,8 @@ def create_ref_dict(ref_fasta) -> dict:
     ref_dict = {}
     ref_fasta = pysam.FastaFile(ref_fasta)
     for idx, base in enumerate(ref_fasta.fetch(ref_fasta.references[0])):
+        # Allows lowercase and uppercase refs, fixes small bug
+        base = base.upper()
         ref_dict[idx] = base
 
     return ref_dict

diff --git a/bin/cs_vcf_filter.py b/bin/cs_vcf_filter.py
@@ -72,8 +72,12 @@ def check_filter(self, v):
         # These seem to not be being pulled out though so check if the qual is none to fail them
         if qual == None:
             return False
-        # Temp filter of qual 15, will need to re-evaluate with more data
-        if qual < 15:
+        # 2 is the default for clair3 so bump slightly up to 3
+        if qual < 3:
+            return False
+
+        # Allele fraction > 0.75 required
+        if len(v.samples) != 1 or v.samples[0].data.AF < 0.75:
             return False
 
         if self.no_frameshifts and not in_frame(v):

diff --git a/conf/base.config b/conf/base.config
@@ -26,12 +26,12 @@ process {
     withLabel: process_single {
         cpus   = { check_max( 1                  , 'cpus'    ) }
         memory = { check_max( 4.GB * task.attempt, 'memory'  ) }
-        time   = { check_max( 4.h  * task.attempt, 'time'    ) }
+        time   = { check_max( 6.h  * task.attempt, 'time'    ) }
     }
     withLabel: process_low {
         cpus   = { check_max( 2     * task.attempt, 'cpus'    ) }
         memory = { check_max( 8.GB * task.attempt, 'memory'  ) }
-        time   = { check_max( 4.h   * task.attempt, 'time'    ) }
+        time   = { check_max( 6.h   * task.attempt, 'time'    ) }
     }
     withLabel: process_medium {
         cpus   = { check_max( 4     * task.attempt, 'cpus'    ) }
@@ -47,7 +47,9 @@ process {
         time   = { check_max( 20.h  * task.attempt, 'time'    ) }
     }
     withLabel: process_high_memory {
-        memory = { check_max( 90.GB * task.attempt, 'memory' ) }
+        cpus   = { check_max( 2     * task.attempt, 'cpus'    ) }
+        memory = { check_max( 32.GB * task.attempt, 'memory' ) }
+        time   = { check_max( 8.h  * task.attempt, 'time'    ) }
     }
     withLabel: error_ignore {
         errorStrategy = 'ignore'

diff --git a/conf/nml.config b/conf/nml.config
@@ -13,16 +13,18 @@ params {
     max_retries = 3
     max_jobs = 100
 }
+executor {
+    name = 'slurm'
+    queueSize = params.max_jobs
+}
 process {
     // Base process
-    executor    = "slurm"
     queue       = "${params.partition}"
 
     // Allow up to given retries per process - error strat is same as nf-core base config
     errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
     maxRetries = params.max_retries
 
     // Don't spam the cluster
-    queueSize = params.max_jobs
     submitRateLimit = "10sec"
 }
diff --git a/docs/example_files/inputs/input.csv b/docs/example_files/inputs/input.csv
@@ -1,11 +1,11 @@
-sample,reads
-negative-ctrl-1,barcode11
-run-ntc-1,barcode94
-pos-ctrl-1,barcode95
-cov-1,barcode5
-cov-2,barcode01
-cov-3,barcode12
-cov-4,barcode19
-cov-5,barcode40
-cov-6,barcode61
-cov-7,barcode75
+sample,fastq_1
+negative-ctrl-1,fastqs/negative_ctrl.fastq
+run-ntc-1,fastqs/ntc-1.fastq
+pos-ctrl-1,fastqs/pos-cov-1.fastq
+cov-1,fastqs/sample1.fastq
+cov-2,fastqs/sample2.fastq
+cov-3,fastqs/sample3.fastq
+cov-4,fastqs/sample4.fastq
+cov-5,fastqs/sample5.fastq
+cov-6,fastqs/sample6.fastq
+cov-7,fastqs/sample7.fastq
diff --git a/docs/usage.md b/docs/usage.md
@@ -1,15 +1,15 @@
 # phac-nml/viralassembly: Usage
 
 ## Introduction
-This pipeline is intended to be run on either Nanopore Amplicon Sequencing data or Basic Nanopore NGS Sequencing data that can utilize a reference genome for mapping variant calling, and other downstream analyses. It generates variant calls, consensus sequences, and quality control information based on the reference. To do this, there are three different variant callers that can be utilized which includes: `clair3`, `medaka`, and `nanopolish` (For R9.4.1 flowcells and below only).
+This pipeline is intended to be run on either Nanopore Amplicon Sequencing data or Basic Nanopore NGS Sequencing data that can utilize a reference genome for mapping variant calling, and other downstream analyses. It generates variant calls, consensus sequences, and quality control information based on the reference. To do this, there are three different variant callers that can be utilized which includes: `clair3`, `medaka`, and `nanopolish` (which is for R9.4.1 flowcells and below only!).
 
 For Amplicon Sequencing data it is at minimum required to:
 1. Specify a path to the reads/input file
 2. Specify the scheme name
 3. Specify the scheme version
 4. Pick a variant caller and caller model
 
-For Basic NGS sequencing data it is required to:
+For Basic NGS Sequencing data it is at minimum required to:
 1. Specify a path to the reads/input file
 2. Specify a path to the reference genome
 3. Pick a variant caller and caller model
@@ -37,7 +37,7 @@ For Basic NGS sequencing data it is required to:
 - [Core Nextflow Arguments](#core-nextflow-arguments)
 
 ## Profiles
-Profiles are used to specify dependency installation, resources, and how to handle pipeline jobs. You can specify more than one profile but avoid passing in more than one dependency managment profiles. They can be passed with `-profile <PROFILE>`
+Profiles are used to specify dependency installation, resources, and how to handle pipeline jobs. You can specify more than one profile but *avoid* passing in more than one dependency managment profiles. They can be passed with `-profile <PROFILE>`
 
 Available:
 - `conda`: Utilize conda to install dependencies and environment management
@@ -49,7 +49,7 @@ Available:
 Two options for fastq data input: `--fastq_pass <FASTQ_PASS/>` or `--input <INPUT.csv>`
 
 ### Fastq Pass Directory (--fastq_pass)
-Specify fastq data to input based on a given directory. The directory can either be barcoded, as would be seen after demultiplexing, or it could be a flat input of fastq files. The barcoded fastq data will be output with the barcode number but can be renamed with a [metadata csv]() input. The flat fastq files will keep their basename (separated out at the first `.`).
+Specify fastq data to input based on a given directory. The directory can either contain barcoded directories (barcodexx), as would be seen after demultiplexing, or it could contain sample fastq files (one fastq per sample). The barcoded fastq data will be output with the barcode number but can be renamed with a [metadata tsv](#metadata) file input. The flat fastq files will keep their basename (separated out at the first `.`). Example:
 
 Barcoded:
 ```
@@ -75,18 +75,20 @@ Flat:
 ```
 
 ### Input CSV (--input)
-You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to pass in an input CSV file containing 2 columns, `sample`, and `reads` where:
+You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to pass in an input CSV file containing 2 columns, `sample`, and `fastq_1` where:
 - `sample` is the sample name to use
-- `reads` is the path to the barcode directory containing reads
+- `fastq_1` is the path to one fastq file per sample in `.fastq*` format
 
 Ex.
-| sample | reads |
+| sample | fastq_1 |
 | - | - |
-| sample1 | /path/to/barcode01 |
-| ntc | /path/to/barcode02 |
-| pos | /path/to/barcode03 |
+| sample1 | /path/to/sample.fastq |
+| sample2 | /path/to/sample2-1.fastq |
+| sample2 | /path/to/sample-2.fastq |
+| ntc | /path/to/control.fastq |
+| pos | /path/to/pos.fastq |
 
-This will be expanded upon in future releases to allow more varied inputs for the input sheet.
+A sample can be given multiple fastq files if it was resequenced or needed a top up run. If there are multiple fastq files for a sample they will be concatenated and gzipped. If not, the input fastq file will just be gzipped (if it isn't already).
 
 ## Variant Callers
 Three different variant callers are available with slightly different options regarding running with them. For the most accurate results when running with `clair3` or `medaka` pick a model that best matches the input data!!
@@ -116,19 +118,19 @@ And has the optional parameters of:
 Medaka models come built in with the tool itself with the default set to `r941_min_hac_g507` which can be changed with `--medaka_model <MODEL>` parameter shown above. More information on models [can be found here](https://github.com/nanoporetech/medaka#models). Remember to pick a model that best represents the data!
 
 ### [Nanopolish](https://github.com/jts/nanopolish)
-Nanopolish is a software package for signal-level analysis of Oxford Nanopore sequencing data. It does not presently support the R10.4 flowcells so as a variant caller it should only be used with R9.4 flowcells. It also requires that the fastq data is in barcoded directories to work correctly.
+Nanopolish is a software package for signal-level analysis of Oxford Nanopore sequencing data. It *does not presently support the R10.4 flowcells* so as a variant caller it should only be used with R9.4 flowcells.
 
 Running with `nanopolish` requires the following parameters:
 - `--variant_caller nanopolish`
 - `--fast5_pass <FAST5_PASS/>`
 - `--sequencing_summary <SEQ_SUM.txt>`
 
-Nanopolish requires the fast5 directory along with the sequencing summary file to be used as input instead of a model.
+Nanopolish requires the fast5 directory along with the sequencing summary file to be used as input instead of a model. As such, nanopolish requires that the read ids in the fastq files are linked by the sequencing summary file to their signal-level data in the fast5 files. This makes it **a lot** easier to run using barcoded directories but it can be done with individual read files
 
 ## Running the pipeline
 
 ### Amplicon
-The typical command for running the pipeline with an amplicon scheme using medaka and a different medaka model is as follows:
+The typical command for running the pipeline with an [amplicon scheme](#schemes-and-reference) using medaka and a different medaka model is as follows:
 
 ```bash
 nextflow run phac-nml/viralassembly \
@@ -223,12 +225,13 @@ Use `--version` to see version information
 ### All Parameters
 | Parameter | Description | Type | Default | Notes |
 | - | - | - | - | - |
+| --fastq_pass | Path to directory containing `barcode##` subdirectories OR Path to directory containing `*.fastq*` files | Path | null | [Option for input params](#input-parameters) |
+| --input | Path to samplesheet with information about the samples you would like to analyse | Path | null | [Option for input params](#input-parameters) |
 | --variant_caller | Pick from the 3 variant callers: 'clair3', 'medaka', 'nanopolish' | Choice | '' | Details above |
 | --clair3_model | Clair3 base model to be used in the pipeline | Str | 'r941_prom_sup_g5014' | Default model will not work the best for all inputs. [See clair3 docs](https://github.com/HKU-BAL/Clair3#pre-trained-models) for additional info |
 | --clair3_user_variant_model | Path to clair3 additional model directory to use instead of a base model | Path | '' | Default model will not work the best for all inputs. [See clair3 docs](https://github.com/HKU-BAL/Clair3#pre-trained-models) for additional info |
 | --clair3_no_pool_split | Do not split reads into separate pools | Bool | False | Clair3 amplicon sequencing only |
 | --medaka_model | Medaka model to be used in the pipeline | Str | 'r941_min_hac_g507' | Default model will not work the best for all inputs. [See medaka docs](https://github.com/nanoporetech/medaka#models) for additional info |
-| --fastq_pass | Path to directory containing `barcode##` subdirectories | Path | null |  |
 | --fast5_pass | Path to directory containing `barcode##` fast5 subdirectories | Path | null | Only for nanopolish |
 | --sequencing_summary | Path to run `sequencing_summary*.txt` file | Path | null | Only for nanopolish |
 | --min_length | Minimum read length to be kept | Int | 200 | For artic guppyplex |

diff --git a/lib/NfcoreSchema.groovy b/lib/NfcoreSchema.groovy
@@ -117,7 +117,7 @@ class NfcoreSchema {
         for (specifiedParam in params.keySet()) {
             // nextflow params
             if (nf_params.contains(specifiedParam)) {
-                log.error "ERROR: You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'"
+                log.error "You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'"
                 has_error = true
             }
             // unexpected params
@@ -156,7 +156,7 @@ class NfcoreSchema {
             schema.validate(params_json)
         } catch (ValidationException e) {
             println ''
-            log.error 'ERROR: Validation of pipeline parameters failed!'
+            log.error 'Validation of pipeline parameters failed!'
             JSONObject exceptionJSON = e.toJSON()
             printExceptions(exceptionJSON, params_json, log, enums)
             println ''