Skip to content

Commit

Permalink
Merge pull request #5 from phac-nml/update/clair3_params
Browse files Browse the repository at this point in the history
Update Clair3 Params to be less strict
  • Loading branch information
DarianHole authored Jan 20, 2025
2 parents 69dd02f + 3c5566d commit 1969556
Show file tree
Hide file tree
Showing 24 changed files with 291 additions and 138 deletions.
12 changes: 6 additions & 6 deletions .github/test-data/nanopore/input.csv
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
sample,reads
sample1,.github/test-data/nanopore/fastq_pass/barcode01
sample2,.github/test-data/nanopore/fastq_pass/barcode02
100_reads,.github/test-data/nanopore/fastq_pass/barcode03
25_reads,.github/test-data/nanopore/fastq_pass/barcode24
negative-ctrl,.github/test-data/nanopore/fastq_pass/barcode94
sample,fastq_1
sample1,.github/test-data/nanopore/fastq_pass/barcode01/barcode01.fastq.gz
sample2,.github/test-data/nanopore/fastq_pass/barcode02/barcode02.fastq.gz
100_reads,.github/test-data/nanopore/fastq_pass/barcode03/barcode03.fastq.gz
25_reads,.github/test-data/nanopore/fastq_pass/barcode24/barcode24.fastq.gz
negative-ctrl,.github/test-data/nanopore/fastq_pass/barcode94/barcode94.fastq.gz
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,23 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v1.1.0 - [Unreleased]

### `Added`
- Input schema JSON and validation
- FORMAT_INPUT workflow
- Handles the input data now
- `[email protected]` plugin

### `Changed`
- `--input SAMPLESHEET_CSV` header
- Went from `reads` with path to barcode directories to `fastq_1` with path to fastq files
- Fixed bug so that SNPEff will now work with given gff files
- Issue was typo related in the build module
- Fixed bug with `calc_bam_variation` caused by genome case
- Log and error statements
- Fixed the cache directory statements

## v1.0.0 - [2024-03-22]

Initial release of `phac-nml/viralassembly`, created from combining the [nf-core](https://nf-co.re/) template with the artic steps.
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Some of the goals of this pipeline are:
2. Allow the pipeline to be used on other viruses with or without amplicon schemes
- Due to the QC steps there is unfortunately a current limitation at working with segmented viruses
- The pipeline will automatically exit after assembly and not generate QC and Reports for these at this time
- This will be fully implemented at some point
- This will hopefully be fully implemented at some point in the future
3. Provide `Run` level and `Sample` level final reports

## Index
Expand All @@ -32,7 +32,7 @@ Some of the goals of this pipeline are:
- Conda command: `conda create on nextflow -c conda-forge -c bioconda nextflow`
2. Install with the instructions at https://www.nextflow.io/

2. Run the pipeline with a profile to handle dependencies:
2. Run the pipeline with one of the following profiles to handle dependencies (or use your own profile if you have one!):
- `conda`
- `mamba`
- `singularity`
Expand All @@ -43,8 +43,8 @@ Simple commands to run input data. Input data can be done in three different way
1. Passing `--fastq_pass </PATH/TO/fastq_pass>` where `fastq_pass` is a directory containing `barcode##` subdirectories with fastq files
2. Passing `--fastq_pass </PATH/TO/fastqs>` where `fastqs` is a directory containing `.fastq*` files
3. Passing `--input <samplesheet.csv>` where `samplesheet.csv` is a CSV file with two columns
1. `sample` - The name of the sample
2. `reads` - Path to a directory containing reads for the sample in `.fastq*` format
1. `sample` - The name of the sample
2. `fastq_1` - Path to one fastq file per sample in `.fastq*` format

The basic examples will show how to run the pipeline using the `--fastq_pass` input but it could be subbed in for the `--input` CSV file if wanted.

Expand Down Expand Up @@ -103,7 +103,7 @@ nextflow run /PATH/TO/artic-generic-nf/main.nf \
Medaka model information [can be found here](https://github.com/nanoporetech/medaka#models)

### Nanopore - Nanopolish
Running the pipeline with [nanopolish](https://github.com/jts/nanopolish) for variant calls requires fastq files, fast5 files, and the sequencing summary file. When running, the pipeline will look for subdirectories off of the input directory called `barcode##` to be used in the pipeline.
Running the pipeline with [nanopolish](https://github.com/jts/nanopolish) for variant calls requires fastq files, fast5 files, and the sequencing summary file instead of providing a model. As such, nanopolish requires that the read ids in the fastq files are linked by the sequencing summary file to their signal-level data in the fast5 files. This makes it **a lot** easier to run using barcoded directories but it can be run with individual read files

See the [nanopolish section](./docs/usage.md#nanopolish) of the usage docs for more information

Expand Down Expand Up @@ -173,8 +173,8 @@ Contributions are welcome through creating PRs or Issues
## Legal
Copyright 2023 Government of Canada

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

http://www.apache.org/licenses/LICENSE-2.0
https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
25 changes: 25 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://raw.githubusercontent.com/nf-schema/example/master/assets/schema_input.json",
"title": "nf-schema example - params.input schema",
"description": "Schema for the file provided with params.input",
"type": "array",
"items": {
"type": "object",
"properties": {
"sample": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample name must be provided and cannot contain spaces",
"meta": ["id"]
},
"fastq_1": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q(\\.gz)?$",
"format": "file-path",
"errorMessage": "FastQ file for reads must be provided, cannot contain spaces, and must have extension '.fq(.gz)' or '.fastq(.gz)'"
}
},
"required": ["sample", "fastq_1"]
}
}
2 changes: 2 additions & 0 deletions bin/calc_bam_variation.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,8 @@ def create_ref_dict(ref_fasta) -> dict:
ref_dict = {}
ref_fasta = pysam.FastaFile(ref_fasta)
for idx, base in enumerate(ref_fasta.fetch(ref_fasta.references[0])):
# Allows lowercase and uppercase refs, fixes small bug
base = base.upper()
ref_dict[idx] = base

return ref_dict
Expand Down
8 changes: 6 additions & 2 deletions bin/cs_vcf_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,12 @@ def check_filter(self, v):
# These seem to not be being pulled out though so check if the qual is none to fail them
if qual == None:
return False
# Temp filter of qual 15, will need to re-evaluate with more data
if qual < 15:
# 2 is the default for clair3 so bump slightly up to 3
if qual < 3:
return False

# Allele fraction > 0.75 required
if len(v.samples) != 1 or v.samples[0].data.AF < 0.75:
return False

if self.no_frameshifts and not in_frame(v):
Expand Down
8 changes: 5 additions & 3 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,12 @@ process {
withLabel: process_single {
cpus = { check_max( 1 , 'cpus' ) }
memory = { check_max( 4.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
time = { check_max( 6.h * task.attempt, 'time' ) }
}
withLabel: process_low {
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
memory = { check_max( 8.GB * task.attempt, 'memory' ) }
time = { check_max( 4.h * task.attempt, 'time' ) }
time = { check_max( 6.h * task.attempt, 'time' ) }
}
withLabel: process_medium {
cpus = { check_max( 4 * task.attempt, 'cpus' ) }
Expand All @@ -47,7 +47,9 @@ process {
time = { check_max( 20.h * task.attempt, 'time' ) }
}
withLabel: process_high_memory {
memory = { check_max( 90.GB * task.attempt, 'memory' ) }
cpus = { check_max( 2 * task.attempt, 'cpus' ) }
memory = { check_max( 32.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel: error_ignore {
errorStrategy = 'ignore'
Expand Down
6 changes: 4 additions & 2 deletions conf/nml.config
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,18 @@ params {
max_retries = 3
max_jobs = 100
}
executor {
name = 'slurm'
queueSize = params.max_jobs
}
process {
// Base process
executor = "slurm"
queue = "${params.partition}"

// Allow up to given retries per process - error strat is same as nf-core base config
errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
maxRetries = params.max_retries

// Don't spam the cluster
queueSize = params.max_jobs
submitRateLimit = "10sec"
}
22 changes: 11 additions & 11 deletions docs/example_files/inputs/input.csv
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
sample,reads
negative-ctrl-1,barcode11
run-ntc-1,barcode94
pos-ctrl-1,barcode95
cov-1,barcode5
cov-2,barcode01
cov-3,barcode12
cov-4,barcode19
cov-5,barcode40
cov-6,barcode61
cov-7,barcode75
sample,fastq_1
negative-ctrl-1,fastqs/negative_ctrl.fastq
run-ntc-1,fastqs/ntc-1.fastq
pos-ctrl-1,fastqs/pos-cov-1.fastq
cov-1,fastqs/sample1.fastq
cov-2,fastqs/sample2.fastq
cov-3,fastqs/sample3.fastq
cov-4,fastqs/sample4.fastq
cov-5,fastqs/sample5.fastq
cov-6,fastqs/sample6.fastq
cov-7,fastqs/sample7.fastq
33 changes: 18 additions & 15 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# phac-nml/viralassembly: Usage

## Introduction
This pipeline is intended to be run on either Nanopore Amplicon Sequencing data or Basic Nanopore NGS Sequencing data that can utilize a reference genome for mapping variant calling, and other downstream analyses. It generates variant calls, consensus sequences, and quality control information based on the reference. To do this, there are three different variant callers that can be utilized which includes: `clair3`, `medaka`, and `nanopolish` (For R9.4.1 flowcells and below only).
This pipeline is intended to be run on either Nanopore Amplicon Sequencing data or Basic Nanopore NGS Sequencing data that can utilize a reference genome for mapping variant calling, and other downstream analyses. It generates variant calls, consensus sequences, and quality control information based on the reference. To do this, there are three different variant callers that can be utilized which includes: `clair3`, `medaka`, and `nanopolish` (which is for R9.4.1 flowcells and below only!).

For Amplicon Sequencing data it is at minimum required to:
1. Specify a path to the reads/input file
2. Specify the scheme name
3. Specify the scheme version
4. Pick a variant caller and caller model

For Basic NGS sequencing data it is required to:
For Basic NGS Sequencing data it is at minimum required to:
1. Specify a path to the reads/input file
2. Specify a path to the reference genome
3. Pick a variant caller and caller model
Expand Down Expand Up @@ -37,7 +37,7 @@ For Basic NGS sequencing data it is required to:
- [Core Nextflow Arguments](#core-nextflow-arguments)

## Profiles
Profiles are used to specify dependency installation, resources, and how to handle pipeline jobs. You can specify more than one profile but avoid passing in more than one dependency managment profiles. They can be passed with `-profile <PROFILE>`
Profiles are used to specify dependency installation, resources, and how to handle pipeline jobs. You can specify more than one profile but *avoid* passing in more than one dependency managment profiles. They can be passed with `-profile <PROFILE>`

Available:
- `conda`: Utilize conda to install dependencies and environment management
Expand All @@ -49,7 +49,7 @@ Available:
Two options for fastq data input: `--fastq_pass <FASTQ_PASS/>` or `--input <INPUT.csv>`

### Fastq Pass Directory (--fastq_pass)
Specify fastq data to input based on a given directory. The directory can either be barcoded, as would be seen after demultiplexing, or it could be a flat input of fastq files. The barcoded fastq data will be output with the barcode number but can be renamed with a [metadata csv]() input. The flat fastq files will keep their basename (separated out at the first `.`).
Specify fastq data to input based on a given directory. The directory can either contain barcoded directories (barcodexx), as would be seen after demultiplexing, or it could contain sample fastq files (one fastq per sample). The barcoded fastq data will be output with the barcode number but can be renamed with a [metadata tsv](#metadata) file input. The flat fastq files will keep their basename (separated out at the first `.`). Example:

Barcoded:
```
Expand All @@ -75,18 +75,20 @@ Flat:
```

### Input CSV (--input)
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to pass in an input CSV file containing 2 columns, `sample`, and `reads` where:
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to pass in an input CSV file containing 2 columns, `sample`, and `fastq_1` where:
- `sample` is the sample name to use
- `reads` is the path to the barcode directory containing reads
- `fastq_1` is the path to one fastq file per sample in `.fastq*` format

Ex.
| sample | reads |
| sample | fastq_1 |
| - | - |
| sample1 | /path/to/barcode01 |
| ntc | /path/to/barcode02 |
| pos | /path/to/barcode03 |
| sample1 | /path/to/sample.fastq |
| sample2 | /path/to/sample2-1.fastq |
| sample2 | /path/to/sample-2.fastq |
| ntc | /path/to/control.fastq |
| pos | /path/to/pos.fastq |

This will be expanded upon in future releases to allow more varied inputs for the input sheet.
A sample can be given multiple fastq files if it was resequenced or needed a top up run. If there are multiple fastq files for a sample they will be concatenated and gzipped. If not, the input fastq file will just be gzipped (if it isn't already).

## Variant Callers
Three different variant callers are available with slightly different options regarding running with them. For the most accurate results when running with `clair3` or `medaka` pick a model that best matches the input data!!
Expand Down Expand Up @@ -116,19 +118,19 @@ And has the optional parameters of:
Medaka models come built in with the tool itself with the default set to `r941_min_hac_g507` which can be changed with `--medaka_model <MODEL>` parameter shown above. More information on models [can be found here](https://github.com/nanoporetech/medaka#models). Remember to pick a model that best represents the data!

### [Nanopolish](https://github.com/jts/nanopolish)
Nanopolish is a software package for signal-level analysis of Oxford Nanopore sequencing data. It does not presently support the R10.4 flowcells so as a variant caller it should only be used with R9.4 flowcells. It also requires that the fastq data is in barcoded directories to work correctly.
Nanopolish is a software package for signal-level analysis of Oxford Nanopore sequencing data. It *does not presently support the R10.4 flowcells* so as a variant caller it should only be used with R9.4 flowcells.

Running with `nanopolish` requires the following parameters:
- `--variant_caller nanopolish`
- `--fast5_pass <FAST5_PASS/>`
- `--sequencing_summary <SEQ_SUM.txt>`

Nanopolish requires the fast5 directory along with the sequencing summary file to be used as input instead of a model.
Nanopolish requires the fast5 directory along with the sequencing summary file to be used as input instead of a model. As such, nanopolish requires that the read ids in the fastq files are linked by the sequencing summary file to their signal-level data in the fast5 files. This makes it **a lot** easier to run using barcoded directories but it can be done with individual read files

## Running the pipeline

### Amplicon
The typical command for running the pipeline with an amplicon scheme using medaka and a different medaka model is as follows:
The typical command for running the pipeline with an [amplicon scheme](#schemes-and-reference) using medaka and a different medaka model is as follows:

```bash
nextflow run phac-nml/viralassembly \
Expand Down Expand Up @@ -223,12 +225,13 @@ Use `--version` to see version information
### All Parameters
| Parameter | Description | Type | Default | Notes |
| - | - | - | - | - |
| --fastq_pass | Path to directory containing `barcode##` subdirectories OR Path to directory containing `*.fastq*` files | Path | null | [Option for input params](#input-parameters) |
| --input | Path to samplesheet with information about the samples you would like to analyse | Path | null | [Option for input params](#input-parameters) |
| --variant_caller | Pick from the 3 variant callers: 'clair3', 'medaka', 'nanopolish' | Choice | '' | Details above |
| --clair3_model | Clair3 base model to be used in the pipeline | Str | 'r941_prom_sup_g5014' | Default model will not work the best for all inputs. [See clair3 docs](https://github.com/HKU-BAL/Clair3#pre-trained-models) for additional info |
| --clair3_user_variant_model | Path to clair3 additional model directory to use instead of a base model | Path | '' | Default model will not work the best for all inputs. [See clair3 docs](https://github.com/HKU-BAL/Clair3#pre-trained-models) for additional info |
| --clair3_no_pool_split | Do not split reads into separate pools | Bool | False | Clair3 amplicon sequencing only |
| --medaka_model | Medaka model to be used in the pipeline | Str | 'r941_min_hac_g507' | Default model will not work the best for all inputs. [See medaka docs](https://github.com/nanoporetech/medaka#models) for additional info |
| --fastq_pass | Path to directory containing `barcode##` subdirectories | Path | null | |
| --fast5_pass | Path to directory containing `barcode##` fast5 subdirectories | Path | null | Only for nanopolish |
| --sequencing_summary | Path to run `sequencing_summary*.txt` file | Path | null | Only for nanopolish |
| --min_length | Minimum read length to be kept | Int | 200 | For artic guppyplex |
Expand Down
4 changes: 2 additions & 2 deletions lib/NfcoreSchema.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ class NfcoreSchema {
for (specifiedParam in params.keySet()) {
// nextflow params
if (nf_params.contains(specifiedParam)) {
log.error "ERROR: You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'"
log.error "You used a core Nextflow option with two hyphens: '--${specifiedParam}'. Please resubmit with '-${specifiedParam}'"
has_error = true
}
// unexpected params
Expand Down Expand Up @@ -156,7 +156,7 @@ class NfcoreSchema {
schema.validate(params_json)
} catch (ValidationException e) {
println ''
log.error 'ERROR: Validation of pipeline parameters failed!'
log.error 'Validation of pipeline parameters failed!'
JSONObject exceptionJSON = e.toJSON()
printExceptions(exceptionJSON, params_json, log, enums)
println ''
Expand Down
Loading

0 comments on commit 1969556

Please sign in to comment.