diff --git a/.gitignore b/.gitignore index 89228c39..c0b23863 100644 --- a/.gitignore +++ b/.gitignore @@ -42,3 +42,6 @@ binned* # snakemake stuff .snakemake/ + +# python stuff +__pycache__/ \ No newline at end of file diff --git a/README.md b/README.md index 9bcc6780..0f9aff8a 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ Read Assignment, Mapping, and Phylogenetic Analysis in Real Time. RAMPART runs concurrently with MinKNOW and shows you demuxing / mapping results in real time. -![](docs/images/main.png) +![](docs/img/main.png) ## Motivation @@ -13,29 +13,12 @@ Furthermore, the small size of many pathogens mean that insightful sequence data RAMPART run concurrently with MinION sequencing of such pathogens. It provides a real-time overview of genome coverage and reference matching for each barcode. -RAMPART was originally designed to work with amplicon-based primer schemes (e.g. for [ebola](https://github.com/artic-network/primer-schemes)), but this isn't a requirement. - - +This version of RAMPART is designed for ... ## Documentation -* [Installation](docs/installation.md) -* [Running an example dataset & understanding the visualisations](docs/examples.md) +* [Installation](docs/installation.md) * [Setting up for your own run](docs/setting-up.md) * [Configuring RAMPART using protocols](docs/protocols.md) -* [Debugging when things don't work](docs/debugging.md) -* [Notes relating to RAMPART development](docs/developing.md) - - - - -## Status - -RAMPART is in development with a publication forthcoming. -Please [get in contact](https://twitter.com/hamesjadfield) if you have any issues, questions or comments. - - -## RAMPART has been deployed to sequence: +* [Covid strand matching pipeline](docs/barcode_strand_match.md) -* [Yellow Fever Virus in Brazil](https://twitter.com/Hill_SarahC/status/1149372404260593664) -* [ARTIC workshop in Accra, Ghana](https://twitter.com/george_l/status/1073245364197711874) diff --git a/default_protocol/pipelines.json b/default_protocol/pipelines.json index 4f0cfb6b..7932cd5f 100644 --- a/default_protocol/pipelines.json +++ b/default_protocol/pipelines.json @@ -1,7 +1,7 @@ { "annotation": { "name": "Annotate reads", - "path": "pipelines/demux_map", + "path": "../pipelines/default_pipeline/demux_map", "config_file": "config.yaml", "requires": [ { @@ -12,7 +12,7 @@ }, "export_reads": { "name": "Export reads", - "path": "pipelines/bin_to_fastq", + "path": "../pipelines/default_pipeline/bin_to_fastq", "config_file": "config.yaml", "run_per_sample": true } diff --git a/docs/barcode_strand_match.md b/docs/barcode_strand_match.md new file mode 100644 index 00000000..8d10d7dd --- /dev/null +++ b/docs/barcode_strand_match.md @@ -0,0 +1,92 @@ +# Covid strand matching pipeline + +This pipeline looks for new `.csv` files that RAMPART has created in annotations pipeline (located in `/annotations` folder) since last time when the pipeline was triggered. + +Then it looks for matching `.fastq` files and creates `.bam` files from them one by one, using `minimap2`. + +The `.bam` files are no longer needed after we process them in our python script, so that they are deleted afterwards. + +The first output file of this pipeline is located in `/annotations/base_count/count.csv` and contains counts of each base (A, C, G or T) for each position in reference for each barcode. In each line of the file there is + +* position in the reference genome +* barcode name +* count of A's mapped to this position in reference genome +* count of C's +* count of G's +* count of T's + +When you trigger the pipeline next time, the pipeline will use this output file as one of its inputs, new counts will be added to those from the existing `counts.csv` file and a new file will be created as output. + +The next step of our pipeline is determining the variants based on the provided `.txt` file of a specific format (see mutations file section below) which contains the changes in reference genome that are specific for some known variants of sars-cov-2. + +By default our pipeline uses one of the `.txt` files we have created. All the `.txt` files are located in /covid_protocol/pipelines/run_python_scripts/rules/mut_files/ +You may also want to provide your own. To do this, you have to create a `.txt` file in the directory mentioned above, and then in `covid_protocol/pipelines/run_python_scripts/config.yaml` replace the name of the file to be used with your own. + +You can also set your own threshold value, which determines minimal number of reads mapped to the position in the reference genome, so that our python script will clasify a mutation as significat enough to support that the barcode sample corresponds to a variant. + +These are the default settings: + +``` +###mutations### +coverage_threshold: 10 +mutations_file: mutbb.txt +``` + + +At the end, the `annotations/results` folder should contain a `mutations.json` file containing the mutations we matched to the barcodes. + +Once the json file is available, it will be loaded to RAMPART and the results will be shown. + +## The mutations file +This file specifies the variants of sars-cov-2 to look for and mutations that are specific for a variant. +Each line starts with a label of a variant at the beginning, +followed by exactly one space and then a number of mutations that we want to match so that we can say that a barcode corresponds to this variant. +Then there are mutations that are typical for a variant separated by spaces. +Lines starting with `#` are comments and are ignored when parsing the file. + +### Example +``` +UK 5 C3267T C5388A ... G28280C A28281T T28282A +``` +in our default file you can see this line, + +starting with "UK", which is our label for this variant. + +the label is followed by a number, 5, which says that "if at least 5 of the following mutations are present int he sample, classify the sample as this variant" + +the number is followed by mutations (for example C is changed to T at the position 3267 mapped to reference genome) that are typical for this variant, separated by spaces. + +### Tree-like structure +You can also provide another variants in tree-like structure, using syntax `start_sub` and `end_sub` in separate lines: +``` +#UK variant +UK 5 C3267T ... T28282A + +#more specific variants for UK +start_sub +UK-subvariant_1 1 A17615G + +#subvariants for UK-subvariant_1 +start_sub +. +. +UK-subvariant_1-Poland 4 C5301T C7420T C9693T G23811T C25350T C28677T G29348T +UK-subvariant_1-Gambia 3 T6916C T18083C G22132A C23929T +end_sub + +end_sub + +#CZ variant +CZ 3 G12988T G15598A G18028T T24910C T26972C +``` + +This means that we will look for UK variant, and if we will find at least 5 mutations from the list provided in the UK line, we will also continue searching for other more specific variants. + +For example if UK variant is matched, we will check whether there is also a mutation in position A17615G, + +if it is, then we will chcek if there are some of the mutations specified in its subsection - UK-subvariant_1-Poland or UK-subvariant_1-Gambia. + +We will stop searching at the point when there are no more subsections specified or when less than the required count of mutations vere found for a sample. + +We will look for CZ variant too. This one has no subvariants specified in this example file, so no further search would be made. + diff --git a/docs/img/main.png b/docs/img/main.png new file mode 100644 index 00000000..cb643f63 Binary files /dev/null and b/docs/img/main.png differ diff --git a/docs/img/main_2.png b/docs/img/main_2.png new file mode 100644 index 00000000..794515a9 Binary files /dev/null and b/docs/img/main_2.png differ diff --git a/docs/img/old/main.png b/docs/img/old/main.png new file mode 100644 index 00000000..fa0fcfde Binary files /dev/null and b/docs/img/old/main.png differ diff --git a/docs/img/old/main_2.png b/docs/img/old/main_2.png new file mode 100644 index 00000000..36c66734 Binary files /dev/null and b/docs/img/old/main_2.png differ diff --git a/docs/img/old/main_3.png b/docs/img/old/main_3.png new file mode 100644 index 00000000..8e0ae292 Binary files /dev/null and b/docs/img/old/main_3.png differ diff --git a/docs/img/s1.png b/docs/img/s1.png new file mode 100644 index 00000000..bd5abd4f Binary files /dev/null and b/docs/img/s1.png differ diff --git a/docs/img/s2.png b/docs/img/s2.png new file mode 100644 index 00000000..f8bd8bfc Binary files /dev/null and b/docs/img/s2.png differ diff --git a/docs/img/s3.png b/docs/img/s3.png new file mode 100644 index 00000000..563b5254 Binary files /dev/null and b/docs/img/s3.png differ diff --git a/docs/img/s4.png b/docs/img/s4.png new file mode 100644 index 00000000..b4e705d8 Binary files /dev/null and b/docs/img/s4.png differ diff --git a/docs/img/s5.png b/docs/img/s5.png new file mode 100644 index 00000000..e1c56448 Binary files /dev/null and b/docs/img/s5.png differ diff --git a/docs/installation.md b/docs/installation.md index 8cbb8b6f..5992c158 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -4,84 +4,25 @@ These instructions assume that you have installed [MinKNOW](https://community.nanoporetech.com/downloads) and are able to run it. -## Install from conda - We also assume that you are using conda -- See [instructions here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) to install conda on your machine. -### Step 1: Create a new conda environment or install nodejs into your current conda environment - -Create a new conda environment and activate it via: - -```bash -conda create -n artic-rampart -y nodejs=12 # any version >10 should be fine -conda activate artic-rampart -``` - -Or install NodeJS into your currently activated environment via: - -```bash -conda install -y nodejs=12 # any version >10 should be fine -``` - -### Step 2: Install RAMPART - -```bash -conda install -y artic-network::rampart=1.1.0 -``` - -### Step 3: Install dependencies - -Note that you may already have some or all of these in your environment, in which case they can be skipped. -Additionally, some are only needed for certain analyses and can also be skipped as desired. - -> If you are installing RAMPART into the [artic-ncov2019](https://github.com/artic-network/artic-ncov2019) conda environment, you will already have all of these dependencies. - - -Python, biopython, snakemake and minimap2 are required - -```bash -conda install -y "python>=3.6" -conda install -y anaconda::biopython -conda install -y -c conda-forge -c bioconda "snakemake<5.11" # snakemake 5.11 will not work currently -conda install -y bioconda::minimap2=2.17 -``` - -If you are using guppy to demux samples, you don't need Porechop, -however if you require RAMPART to perform demuxing then you must install the ARTIC fork of Porechop: - -```bash -python -m pip install git+https://github.com/artic-network/Porechop.git@v0.3.2pre -``` - -If you wish to use the post-processing functionality available in RAMPART to bin reads, then you'll need `binlorry`: - -```bash -python -m pip install binlorry==1.3.0_alpha1 -``` - -### Step 4: Check that it works - -``` -rampart --help -``` - ---- ## Install from source (1) Clone the Github repo ```bash -git clone https://github.com/artic-network/rampart.git +git clone https://github.com/fmfi-compbio/rampart.git cd rampart ``` -(2) Create an activate the conda environment with the required dependencies. -You can either follow steps 1 & 3 above, or use the provided `environment.yml` file via +(2) Create an activate the conda environment with the required dependencies using the provided `environment.yml` file via + +*note: we are using a modified version of porechop, where we fixed a bug which caused that in the original version of RAMPART the first 12 barcodes were missing for 96 pcr barcode set* ```bash conda env create -f environment.yml -conda activate artic-rampart +conda activate covid-artic-rampart ``` (3) Install dependencies using `npm` @@ -92,6 +33,8 @@ npm install (4) Build the RAMPART client bundle +*note: you will have to run this command anytime you pull a new version from gitHub* + ```bash npm run build ``` @@ -100,7 +43,7 @@ npm run build so that it is available via the `rampart` command ```bash -npm install --global . +npm install --global ``` Check that things work by running `rampart --help` diff --git a/docs/protocols.md b/docs/protocols.md index 9696206d..c581aec6 100644 --- a/docs/protocols.md +++ b/docs/protocols.md @@ -7,7 +7,7 @@ For this reason a protocol is typically virus-specific, with the run-specific in #### A protocol is composed of 5 JSON files -A protocol is composed of 5 JSON fileswhich control various configuration options. +A protocol is composed of 5 JSON files which control different configuration options. RAMPART will search for each of these JSON files in a cascading fashion, building up information from various sources (see "How RAMPART finds configuration files to build up the protocol" below). This allows us to always start with RAMPARTs "default" protocol, and then add in run-specific information from different JSONs, and potentially modify these via command line arguments. @@ -17,122 +17,83 @@ Each file is described in more detail below, but briefly the five files are: * `genome.json` describes the reference genome of what's being sequenced * `primers.json` describes the position of primers across the genome * `pipelines.json` describes the pipelines used for data processing and analysis by RAMPART -* `run_configuration.json` contains information about the current run, including +* `run_configuration.json` contains information about the current run Typically, you would provide RAMPART with a virus-specific protocol directory containing the first four files, and the run-specific information would be either in a JSON in the current working directory or specified via command line args. +You can find more information about `protocol.json`, `genome.json`, `primers.json` and `run_configuration.json` files in documentation of original version of RAMPART. --- -## How RAMPART finds configuration files to build up the protocol - -RAMPART searches a number of folders in a cascading manner in order to build up the protocol for the run. -Each time it finds a matching JSON it adds it into the overall protocol, overriding previous options as necessary _(technically we're doing a shallow merge of the JSONs)_. Folders searched are, in order: - -1. RAMPART's default protocol. See below for what defaults this sets. -2. Run specific protocol directory. This is set either with `--protocol ` or via the `RAMPART_PROTOCOL` environment variable. -3. The current working directory. - ---- -## Protocol file: `protocol.json` - -This JSON format file contains some basic information about the protocol. -For instance, this is the protocol description for the provided EBOV example: - -```json -{ - "name": "ARTIC Ebola virus protocol v1.1", - "description": "Amplicon based sequencing of Ebola virus (Zaire species).", - "url": "http://artic.network/" -} -``` - -You may also set `annotationOptions` and `displayOptions` in this file (see below for more info). +## Protocol file: `pipelines.json` +Each pipeline defined here is an analysis toolkit available for RAMPART. +It's necessary to provide an `annotation` pipeline to parse and interpret basecalled FASTQ files. +Other pipelines are surfaced to the UI and can be triggered by the user. +For instance, you could define a "generate consensus genome" pipeline and then trigger it for a given sample when you are happy with the coverage statistics presented by RAMPART. ---- -## Protocol file: `genome.json` -This JSON format file describes the structure of the genome (positions of all the genes) and is used by RAMPART to visualize the coverage across the genome. +Each pipeline defined in the JSON must indicate a directory containing a `Snakemake` file which RAMPART will call. This call to snakemake is configured with an optional `config.yaml` file (see below), user defined `--config` options, and dynamically generated `--config` options RAMPART will add to indicate the "state of the world". ```json { - "label": "Ebola virus (EBOV)", - "length": 18959, - "genes": { - "": { - "start": 469, - "end": 2689, - "strand": 1 - }, + "pipeline_name": {...}, ... - }, - "reference": { - "label": "", - "accession": "", - "sequence": "atcg..." - } } ``` +You might also want to use our `barcode_strand_match` pipeline, so that you will be able to use the core modification we added to the original RAMPART version +You can find more about the `barcode_strand_match` pipeline [here](barcode_strand_match) ---- -## Protocol file: `primers.json` - - -The primer scheme description is defined via the `primers.json` file in the protocol directory. - -This JSON format files describes the locations of the amplicons. The coordinates are in reference to the genome description in the `genome.json` file and will be used by RAMPART to draw the the amplicons in the coverage plots. If it is not present then no amplicons will be shown in RAMPART. - -> These data are only used for display, not analysis. +Our pipelines are specified as follows: ```json { - "name": "EBOV primer scheme v1.0", - "amplicons": [ - [32, 1057], - [907, 1881], - . - . - . - [17182, 18183], - [17941, 18916] - ] + "annotation": { + "name": "Annotate reads", + "path": "pipelines/demux_map_covid", + "config_file": "config.yaml", + "requires": [ + { + "file": "references.fasta", + "config_key": "references_file" + } + ] + }, + "barcode_strand_match":{ + "name": "Match strands to barcodes", + "path": "pipelines/run_python_scripts", + "config_file": "config.yaml", + "requires":[ + { + "file": "references.fasta", + "config_key": "references_file" + } + ] + }, + "export_reads": { + "name": "Export reads", + "path": "pipelines/bin_to_fastq", + "config_file": "config.yaml", + "run_per_sample": true + } } - ``` ---- -## Protocol file: `pipelines.json` - -Each pipeline defined here is an analysis toolkit available for RAMPART. -It's necessary to provide an `annotation` pipeline to parse and interpret basecalled FASTQ files. -Other pipelines are surfaced to the UI and can be triggered by the user. -For instance, you could define a "generate consensus genome" pipeline and then trigger it for a given sample when you are happy with the coverage statistics presented by RAMPART. - - -Each pipeline defined in the JSON must indicate a directory containing a `Snakemake` file which RAMPART will call. This call to snakemake is configured with an optional `config.yaml` file (see below), user defined `--config` options, and dynamicaly generated `--config` options RAMPART will add to indicate the "state of the world". -```json -{ - "pipeline_name": {...}, - ... -} -``` - **Required properties** * `"path" {string}` -- the path to the pipeline, relative to the "protocol" directory. There _must_ be a `Snakemake` file in this directory **Optional properties** * `"name" {string}` -- the name of the pipeline. Default: the JSON key name. -* `"run_per_sample" {bool}` -- is the pipeline able to be run for an individual sample? If this is set, then the pipeline will show up as a triggerable entry in the menu of each sample panel. +* `"run_per_sample" {bool}` -- is the pipeline to be run for an individual sample? If this is set, then the pipeline will show up as a triggerable entry in the menu of each sample panel. * `"run_for_all_samples" {bool}` _NOT YET IMPLEMENTED_ * `"config_file" {string}` -- the name of a config file (e.g. `config.yaml`) in the pipeline directory. This will be supplied to Snakemake via `--configfile`. * `"configOptions" {object}` -- options here will be supplied to snakemake via the `--config` argument. The format of these options is `key=value`, and strings are quoted if needed. If `value` is an empty string, then the key is reported alone. If `value` is an array, then the entries are joined using a `,` charater. If `value` is a dictionary, then the keys & values of that are joined via `:`. The values of a dict / array must be strings. For instance, `"configOptions": {"a": "", "b": "B", "c": ["C", "CC"], "d": {"D": "", "DD": "DDD"}}` will get turned into `--config a b=B c=C,CC d=D,DD:DDD`. -* `"requires" {object}` _only usable by the annotation pipeline. see below._ +* `"requires" {object}` _only used by the annotation and barcode_strand_match pipeline. see below._ #### RAMPART injected `--config` information @@ -157,33 +118,34 @@ If filtering is enabled, then the following options are presented to the pipelin #### The annotation pipeline -The `annotation` pipeline is a special pipeline that will process reads as they are basecalled by MinKNOW. The default pipeline is in the `default_protocol` directory in the RAMPART directory but it can be overridden in a `protocol` directory to provide customised behaviour. +The `annotation` pipeline is a special pipeline that will process reads as they are basecalled by MinKNOW. The default pipeline is in the `covid_protocol` directory in the RAMPART directory but it can be overridden in a `protocol` directory to provide customised behaviour. Unlike other pipelines, additional configuration can be provided here (see "Annotation options" below). -Furthermore, the annotation pipeline can use a `requires` property which specifies files to be handed to Snakemake via the `--config` parameter. +Furthermore, the annotation pipeline can use the `requires` property which specifies files to be handed to Snakemake via the `--config` parameter. #### RAMPART's default annotation pipeline The default pipeline will watch for new `.fastq` files appearing in the `basecalled_path` (usually the `fastq/pass` folder in the MinKNOW run's data folder). Each `.fastq` file will contain 4000 reads by default. The pipeline will then de-multiplex the reads looking for any barcodes that were used when creating the sequencing library. It then maps each read to a set of reference genomes, provided by the `protocol` or by the user, recording the closest reference genome and the mapping coordinates of the read on that reference. This information is then recorded for each read in a comma-seprated `.csv` text file with the same file name stem as the original `.fastq` file. It is this file which is then read by RAMPART and the data visualised. -For **de-multiplexing**, The `annotation` pipeline currently uses a customised version of `porechop` that was installed by `conda` when RAMPART was installed. `porechop` is an adapter trimming and demultiplexing package written by Ryan Wick. It's original source can be found at [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop). For RAMPART we have modified it to focus on demultiplexing, making it faster. The forked, modified version can be found at [https://github.com/artic-network/Porechop](https://github.com/artic-network/Porechop). +For **de-multiplexing**, The `annotation` pipeline currently uses a customised version of `porechop` that was installed from git when RAMPART was installed. `porechop` is an adapter trimming and demultiplexing package written by Ryan Wick. It's original source can be found at [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop). For RAMPART the authors have modified it to focus on demultiplexing, making it faster. The forked, modified version can be found at [https://github.com/artic-network/Porechop](https://github.com/artic-network/Porechop). +For this version of RAMPART we have modified it so that we could fix a problem with 96 pcr barcode set. Our forked, modified version can be found at [https://github.com/fmfi-compbio/Porechop](https://github.com/fmfi-compbio/Porechop) If demuxing has been done by guppy, then the **de-multiplexing** step is skipped. -**Reference mapping** is done using `minimap2` ([https://minimap2.org]()). This step requires a `FASTA` file containing at least one reference genome (or sub-genomic region if that is being targetted). The choice of reference sequences will depend on aim of the sequencing task. The reference genome panel could span a range of genotypes or completely different viruses if a metagenomic protocol is being used. The relatively high per-read error rate will probably mean that very close variants cannot be easily distinguished at this stage. +**Reference mapping** is done using `minimap2` ([https://minimap2.org]()). This step requires a `FASTA` file containing at least one reference genome (or sub-genomic region if that is being targetted). The choice of reference sequences will depend on the aim of the sequencing task. The reference genome panel could span a range of genotypes or completely different viruses if a metagenomic protocol is being used. The relatively high per-read error rate will probably mean that very close variants cannot be easily distinguished at this stage. -The mapping coordinates will be recorded based on the closest mapped reference but RAMPART will scale to a single coordinate system based on the reference genome provided the `genome.json` file. +The mapping coordinates will be recorded based on the closest mapped reference but RAMPART will scale to a single coordinate system based on the reference genome provided in the `genome.json` file. #### Annotation options -The default `annotation` pipeline has a number of options that can be specified, primarily to control the demultiplexing step. These options can be specified in the `protocol.json` --- to provide the options that are most appropriate for the lab protocol --- or in the `run_configuration.json` for customization for a particular run. They can also be specified on the command line when RAMPART is started via `--annotationOptions`. +The default `annotation` pipeline has a number of options that can be specified, primarily to control the demultiplexing step. These options can be specified in the `protocol.json` --- to provide the options that are most appropriate for the lab protocol --- or in the `run_configuration.json` for customization for a particular run. They can also be specified from the command line when RAMPART is started via `--annotationOptions`. If demuxing has been performed by guppy, then these options have no effect! - `require_two_barcodes` (default true) - > When true this option requires there to be the same barcode on both ends of the reads to ensure accurate demultiplexing. + > When true this option requires that there is the same barcode on both ends of the reads to ensure accurate demultiplexing. - `barcode_threshold ` (default 80) > How good does the % identity to the best match barcode have to be to assign that barcode to the read? @@ -200,7 +162,7 @@ If demuxing has been performed by guppy, then these options have no effect! - `limit_barcodes_to [BC01, BC02, ...]` (default no limits) > Specify a list of barcodes that were used in the sequencing and limit demultiplexing to these (any others will be put in the unassigned category). The digits at the end of the barcode names are used to designate the barcodes and refer to the barcodes in the barcode set being used. -In `protocol.json` or `run_configuration.json` you can sepecify the annotation pipeline options with a section labelled `annotationOptions`: +In `protocol.json` or `run_configuration.json` you can specify the annotation pipeline options with a section labelled `annotationOptions`: ```json annotationOptions: { "require_two_barcodes": "false", diff --git a/docs/setting-up.md b/docs/setting-up.md index 00db6a5a..1f992df3 100644 --- a/docs/setting-up.md +++ b/docs/setting-up.md @@ -1,73 +1,28 @@ # Setting up your own run -This will walk through how to set up your own run -- including describing the config files which RAMPART uses. -If you haven't already [installed rampart](installation) or [run an example dataset](examples) then please do that now. +This will walk you through how to set up your own run -- including describing the config files which RAMPART uses. +If you haven't already [installed rampart](installation) then please do that now. ## Run configuration -By default a sub-directory called `annotations` will be created in the runtime directory contain the data files generated by the processing pipeline. +By default a sub-directory called `annotations` will be created in the runtime directory containing the data files generated by the processing pipeline. These will be in the format of comma-seperated-value (CSV) files with one for each FASTQ file created by the basecaller (they will have the same filename as the FASTQ files but with the extension `.csv` instead of `.fastq`. The runtime directory can also contain a configuration file called `run_configuration.json` to provide details about the MinION run being performed. +This file can specify the basecalled read path (as an alternative to the command line `--basecalledPath`), and a title for the run -Here's the example of (most of) the Ebola example `run_configuration.json`: - -```json -{ - "title": "EBOV Validation Run", - "basecalledPath": "fastq/pass", - "samples": [ - { - "name": "Mayinga", - "description": "", - "barcodes": [ "BC01" ] - }, - { - "name": "Kikwit", - "description": "", - "barcodes": [ "BC03" ] - }, - { - "name": "Makona", - "description": "", - "barcodes": [ "BC04" ] - }, - { - "name": "Negative Control", - "description": "", - "barcodes": [ "BC02" ] - } - ] -} -``` - -This file can specify the basecalled read path (as an alternative to the command line `--basecalledPath`), a title for the run and also a list of samples, their names and the barcodes that are being used to distinguish them. -If barcodes are specified in this way then only these barcodes will be used and visualized in RAMPART. -These options can also be specified on the command line (`--title`, `--basecalledPath`, `--annotatedPath` & `--barcodeNames`) and will override the options in the JSON file. However the `run_configuration.json` is useful as a way of recording the samples and barcodes used in a run and to help if the run needs to be restarted. +*In the original version of RAMPART, you can specify a list of samples, their names and the barcodes that are being used to distinguish them. However, this feature will not work properly with this version of RAMPART - you will not see the variants matched to the renamed barcodes.* + +These options can also be specified from the command line (`--title`, `--basecalledPath`, `--annotatedPath`) and will override the options in the JSON file. However the `run_configuration.json` is useful as a way of recording the samples and barcodes used in a run and to help if the run needs to be restarted. ## Configuration files define a protocol. -To get richer, more informative real-time analysis a folder of configuration files called a `protocol` directory can be provided. +*To get richer, more informative real-time analysis, a folder of configuration files called a `protocol` directory can be provided. This is a directory of files with specified names and formats that tell RAMPART about what is being sequenced and allows it to visualize the sequencing more appropriately. -It can also contain custom scripts to alter the behaviour or processing of the data. - -Normally the protocol directory is virus-specific, not run-specific. - -### Installing a Protocol - -A protocol directory, a complete set of configuration files and, potentially, custom pipelines can be downloaded as a package. For example the standard [ARTIC Network](http://artic.network/) Ebola virus protocol for RAMPART is available from [GitHub](https://github.com/artic-network/artic-ebov/). It can be downloaded and installed as follows: - -``` -git clone https://github.com/artic-network/artic-ebov.git -``` - -Then to use this protocol, RAMPART should be run with the `--protocol` command line option: - -``` -node /rampart.js --protocol /artic-ebov/rampart -``` +It can also contain custom scripts to alter the behaviour or processing of the data.* +*Normally, the protocol directory is virus-specific, not run-specific.* ### Define your own protocol diff --git a/environment.yml b/environment.yml index c1d80907..120f98e5 100644 --- a/environment.yml +++ b/environment.yml @@ -1,4 +1,4 @@ -name: artic-rampart +name: covid-artic-rampart channels: - bioconda - conda-forge @@ -11,5 +11,6 @@ dependencies: - biopython=1.74 - minimap2=2.17 - pip: + - pysam - binlorry==1.3.0_alpha1 - - git+https://github.com/artic-network/Porechop.git@v0.3.2pre + - git+https://github.com/fmfi-compbio/Porechop.git@master diff --git a/old/README.md b/old/README.md new file mode 100644 index 00000000..3a27816e --- /dev/null +++ b/old/README.md @@ -0,0 +1,44 @@ +## NOTE: this folder contains files from original version of RAMPART, some parts of documentation in this folder and examples are NOT applicable for this version +therefore this part is marked as old. + +# RAMPART +Read Assignment, Mapping, and Phylogenetic Analysis in Real Time. + + +RAMPART runs concurrently with MinKNOW and shows you demuxing / mapping results in real time. + +![](docs/images/main.png) + + +## Motivation +Time is crucial in outbreak analysis, and recent advancements in sequencing prep now mean that sequencing is the bottleneck for many pathogens. +Furthermore, the small size of many pathogens mean that insightful sequence data is obtained in a matter of minutes. +RAMPART run concurrently with MinION sequencing of such pathogens. +It provides a real-time overview of genome coverage and reference matching for each barcode. + +RAMPART was originally designed to work with amplicon-based primer schemes (e.g. for [ebola](https://github.com/artic-network/primer-schemes)), but this isn't a requirement. + + + +## Documentation + +* [Installation](docs/installation.md) +* [Running an example dataset & understanding the visualisations](docs/examples.md) +* [Setting up for your own run](docs/setting-up.md) +* [Configuring RAMPART using protocols](docs/protocols.md) +* [Debugging when things don't work](docs/debugging.md) +* [Notes relating to RAMPART development](docs/developing.md) + + + + +## Status + +RAMPART is in development with a publication forthcoming. +Please [get in contact](https://twitter.com/hamesjadfield) if you have any issues, questions or comments. + + +## RAMPART has been deployed to sequence: + +* [Yellow Fever Virus in Brazil](https://twitter.com/Hill_SarahC/status/1149372404260593664) +* [ARTIC workshop in Accra, Ghana](https://twitter.com/george_l/status/1073245364197711874) diff --git a/docs/debugging.md b/old/docs/debugging.md similarity index 100% rename from docs/debugging.md rename to old/docs/debugging.md diff --git a/docs/developing.md b/old/docs/developing.md similarity index 100% rename from docs/developing.md rename to old/docs/developing.md diff --git a/docs/examples.md b/old/docs/examples.md similarity index 100% rename from docs/examples.md rename to old/docs/examples.md diff --git a/docs/images/coverage.png b/old/docs/images/coverage.png similarity index 100% rename from docs/images/coverage.png rename to old/docs/images/coverage.png diff --git a/docs/images/log.png b/old/docs/images/log.png similarity index 100% rename from docs/images/log.png rename to old/docs/images/log.png diff --git a/docs/images/main.png b/old/docs/images/main.png similarity index 100% rename from docs/images/main.png rename to old/docs/images/main.png diff --git a/docs/images/rateOverTime.png b/old/docs/images/rateOverTime.png similarity index 100% rename from docs/images/rateOverTime.png rename to old/docs/images/rateOverTime.png diff --git a/docs/images/readsOverTime.png b/old/docs/images/readsOverTime.png similarity index 100% rename from docs/images/readsOverTime.png rename to old/docs/images/readsOverTime.png diff --git a/docs/images/readsPerSample.png b/old/docs/images/readsPerSample.png similarity index 100% rename from docs/images/readsPerSample.png rename to old/docs/images/readsPerSample.png diff --git a/docs/images/referenceHeatMap.png b/old/docs/images/referenceHeatMap.png similarity index 100% rename from docs/images/referenceHeatMap.png rename to old/docs/images/referenceHeatMap.png diff --git a/docs/images/referenceMatches.png b/old/docs/images/referenceMatches.png similarity index 100% rename from docs/images/referenceMatches.png rename to old/docs/images/referenceMatches.png diff --git a/docs/images/samplePanel.png b/old/docs/images/samplePanel.png similarity index 100% rename from docs/images/samplePanel.png rename to old/docs/images/samplePanel.png diff --git a/old/docs/installation.md b/old/docs/installation.md new file mode 100644 index 00000000..8cbb8b6f --- /dev/null +++ b/old/docs/installation.md @@ -0,0 +1,107 @@ +# Installation + + +These instructions assume that you have installed [MinKNOW](https://community.nanoporetech.com/downloads) and are able to run it. + + +## Install from conda + +We also assume that you are using conda -- See [instructions here](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) to install conda on your machine. + +### Step 1: Create a new conda environment or install nodejs into your current conda environment + +Create a new conda environment and activate it via: + +```bash +conda create -n artic-rampart -y nodejs=12 # any version >10 should be fine +conda activate artic-rampart +``` + +Or install NodeJS into your currently activated environment via: + +```bash +conda install -y nodejs=12 # any version >10 should be fine +``` + +### Step 2: Install RAMPART + +```bash +conda install -y artic-network::rampart=1.1.0 +``` + +### Step 3: Install dependencies + +Note that you may already have some or all of these in your environment, in which case they can be skipped. +Additionally, some are only needed for certain analyses and can also be skipped as desired. + +> If you are installing RAMPART into the [artic-ncov2019](https://github.com/artic-network/artic-ncov2019) conda environment, you will already have all of these dependencies. + + +Python, biopython, snakemake and minimap2 are required + +```bash +conda install -y "python>=3.6" +conda install -y anaconda::biopython +conda install -y -c conda-forge -c bioconda "snakemake<5.11" # snakemake 5.11 will not work currently +conda install -y bioconda::minimap2=2.17 +``` + +If you are using guppy to demux samples, you don't need Porechop, +however if you require RAMPART to perform demuxing then you must install the ARTIC fork of Porechop: + +```bash +python -m pip install git+https://github.com/artic-network/Porechop.git@v0.3.2pre +``` + +If you wish to use the post-processing functionality available in RAMPART to bin reads, then you'll need `binlorry`: + +```bash +python -m pip install binlorry==1.3.0_alpha1 +``` + +### Step 4: Check that it works + +``` +rampart --help +``` + +--- + +## Install from source + +(1) Clone the Github repo + +```bash +git clone https://github.com/artic-network/rampart.git +cd rampart +``` + +(2) Create an activate the conda environment with the required dependencies. +You can either follow steps 1 & 3 above, or use the provided `environment.yml` file via + +```bash +conda env create -f environment.yml +conda activate artic-rampart +``` + +(3) Install dependencies using `npm` + +```bash +npm install +``` + +(4) Build the RAMPART client bundle + +```bash +npm run build +``` + +(5) (optional, but recommended) install rampart globally within the conda environment +so that it is available via the `rampart` command + +```bash +npm install --global . +``` + +Check that things work by running `rampart --help` + diff --git a/docs/out-of-date/minit_installation.md b/old/docs/out-of-date/minit_installation.md similarity index 100% rename from docs/out-of-date/minit_installation.md rename to old/docs/out-of-date/minit_installation.md diff --git a/docs/out-of-date/quickstart.md b/old/docs/out-of-date/quickstart.md similarity index 100% rename from docs/out-of-date/quickstart.md rename to old/docs/out-of-date/quickstart.md diff --git a/docs/out-of-date/sequencing.md b/old/docs/out-of-date/sequencing.md similarity index 100% rename from docs/out-of-date/sequencing.md rename to old/docs/out-of-date/sequencing.md diff --git a/old/docs/protocols.md b/old/docs/protocols.md new file mode 100644 index 00000000..9696206d --- /dev/null +++ b/old/docs/protocols.md @@ -0,0 +1,274 @@ +# Protocols + +The "protocol" is what defines how RAMPART will behave and look. +It's the primary place where the reference genome(s), primers, analysis pipelines etc are defined. +For this reason a protocol is typically virus-specific, with the run-specific information overlaid. + + +#### A protocol is composed of 5 JSON files + +A protocol is composed of 5 JSON fileswhich control various configuration options. +RAMPART will search for each of these JSON files in a cascading fashion, building up information from various sources (see "How RAMPART finds configuration files to build up the protocol" below). +This allows us to always start with RAMPARTs "default" protocol, and then add in run-specific information from different JSONs, and potentially modify these via command line arguments. + + +Each file is described in more detail below, but briefly the five files are: +* `protocol.json` Description of the protocol's purpose +* `genome.json` describes the reference genome of what's being sequenced +* `primers.json` describes the position of primers across the genome +* `pipelines.json` describes the pipelines used for data processing and analysis by RAMPART +* `run_configuration.json` contains information about the current run, including + + +Typically, you would provide RAMPART with a virus-specific protocol directory containing the first four files, and the run-specific information would be either in a JSON in the current working directory or specified via command line args. + + +--- +## How RAMPART finds configuration files to build up the protocol + +RAMPART searches a number of folders in a cascading manner in order to build up the protocol for the run. +Each time it finds a matching JSON it adds it into the overall protocol, overriding previous options as necessary _(technically we're doing a shallow merge of the JSONs)_. Folders searched are, in order: + +1. RAMPART's default protocol. See below for what defaults this sets. +2. Run specific protocol directory. This is set either with `--protocol ` or via the `RAMPART_PROTOCOL` environment variable. +3. The current working directory. + +--- +## Protocol file: `protocol.json` + +This JSON format file contains some basic information about the protocol. +For instance, this is the protocol description for the provided EBOV example: + +```json +{ + "name": "ARTIC Ebola virus protocol v1.1", + "description": "Amplicon based sequencing of Ebola virus (Zaire species).", + "url": "http://artic.network/" +} +``` + +You may also set `annotationOptions` and `displayOptions` in this file (see below for more info). + + +--- +## Protocol file: `genome.json` + +This JSON format file describes the structure of the genome (positions of all the genes) and is used by RAMPART to visualize the coverage across the genome. + + +```json +{ + "label": "Ebola virus (EBOV)", + "length": 18959, + "genes": { + "": { + "start": 469, + "end": 2689, + "strand": 1 + }, + ... + }, + "reference": { + "label": "", + "accession": "", + "sequence": "atcg..." + } +} +``` + + + +--- +## Protocol file: `primers.json` + + +The primer scheme description is defined via the `primers.json` file in the protocol directory. + +This JSON format files describes the locations of the amplicons. The coordinates are in reference to the genome description in the `genome.json` file and will be used by RAMPART to draw the the amplicons in the coverage plots. If it is not present then no amplicons will be shown in RAMPART. + +> These data are only used for display, not analysis. + +```json +{ + "name": "EBOV primer scheme v1.0", + "amplicons": [ + [32, 1057], + [907, 1881], + . + . + . + [17182, 18183], + [17941, 18916] + ] +} + +``` + +--- +## Protocol file: `pipelines.json` + +Each pipeline defined here is an analysis toolkit available for RAMPART. +It's necessary to provide an `annotation` pipeline to parse and interpret basecalled FASTQ files. +Other pipelines are surfaced to the UI and can be triggered by the user. +For instance, you could define a "generate consensus genome" pipeline and then trigger it for a given sample when you are happy with the coverage statistics presented by RAMPART. + + +Each pipeline defined in the JSON must indicate a directory containing a `Snakemake` file which RAMPART will call. This call to snakemake is configured with an optional `config.yaml` file (see below), user defined `--config` options, and dynamicaly generated `--config` options RAMPART will add to indicate the "state of the world". + + +```json +{ + "pipeline_name": {...}, + ... +} +``` + +**Required properties** +* `"path" {string}` -- the path to the pipeline, relative to the "protocol" directory. There _must_ be a `Snakemake` file in this directory + +**Optional properties** +* `"name" {string}` -- the name of the pipeline. Default: the JSON key name. +* `"run_per_sample" {bool}` -- is the pipeline able to be run for an individual sample? If this is set, then the pipeline will show up as a triggerable entry in the menu of each sample panel. +* `"run_for_all_samples" {bool}` _NOT YET IMPLEMENTED_ +* `"config_file" {string}` -- the name of a config file (e.g. `config.yaml`) in the pipeline directory. This will be supplied to Snakemake via `--configfile`. +* `"configOptions" {object}` -- options here will be supplied to snakemake via the `--config` argument. The format of these options is `key=value`, and strings are quoted if needed. If `value` is an empty string, then the key is reported alone. If `value` is an array, then the entries are joined using a `,` charater. If `value` is a dictionary, then the keys & values of that are joined via `:`. The values of a dict / array must be strings. For instance, `"configOptions": {"a": "", "b": "B", "c": ["C", "CC"], "d": {"D": "", "DD": "DDD"}}` will get turned into `--config a b=B c=C,CC d=D,DD:DDD`. +* `"requires" {object}` _only usable by the annotation pipeline. see below._ + + +#### RAMPART injected `--config` information +When a pipeline is triggered, RAMPART will supply various information about the current state to the snakemake pipeline via `--config`. These will override any user settings with the same key names. + +* `sample_name` the sample name for which the pipeline has been triggered. (Currently pipelines only work per-sample). +* `barcodes` a list of barcodes linked to the sample +* `annotated_path` absolute path to the files produced by the annotation pipeline +* `basecalled_path` absolute path to the basecalled FASTQs +* `output_path` absolute path to where the output of the pipeline should be saved + +If filtering is enabled, then the following options are presented to the pipeline as applicable: + +* `references` a list of references which the display is filtered to (i.e. only reads with a top hit to one of these references are being displayed) +* `maxReadLength` reads longer than this are being filtered out +* `minReadLength` reads shorter than this are being filtered out +* `minRefSimilarity` Lower cutoff (%) for read match similarity +* `maxRefSimilarity` Upper cutoff (%) for read match similarity + +> More options will be added in the future. + + +#### The annotation pipeline + +The `annotation` pipeline is a special pipeline that will process reads as they are basecalled by MinKNOW. The default pipeline is in the `default_protocol` directory in the RAMPART directory but it can be overridden in a `protocol` directory to provide customised behaviour. + +Unlike other pipelines, additional configuration can be provided here (see "Annotation options" below). + +Furthermore, the annotation pipeline can use a `requires` property which specifies files to be handed to Snakemake via the `--config` parameter. + + +#### RAMPART's default annotation pipeline + +The default pipeline will watch for new `.fastq` files appearing in the `basecalled_path` (usually the `fastq/pass` folder in the MinKNOW run's data folder). Each `.fastq` file will contain 4000 reads by default. The pipeline will then de-multiplex the reads looking for any barcodes that were used when creating the sequencing library. It then maps each read to a set of reference genomes, provided by the `protocol` or by the user, recording the closest reference genome and the mapping coordinates of the read on that reference. This information is then recorded for each read in a comma-seprated `.csv` text file with the same file name stem as the original `.fastq` file. It is this file which is then read by RAMPART and the data visualised. + +For **de-multiplexing**, The `annotation` pipeline currently uses a customised version of `porechop` that was installed by `conda` when RAMPART was installed. `porechop` is an adapter trimming and demultiplexing package written by Ryan Wick. It's original source can be found at [https://github.com/rrwick/Porechop](https://github.com/rrwick/Porechop). For RAMPART we have modified it to focus on demultiplexing, making it faster. The forked, modified version can be found at [https://github.com/artic-network/Porechop](https://github.com/artic-network/Porechop). + +If demuxing has been done by guppy, then the **de-multiplexing** step is skipped. + +**Reference mapping** is done using `minimap2` ([https://minimap2.org]()). This step requires a `FASTA` file containing at least one reference genome (or sub-genomic region if that is being targetted). The choice of reference sequences will depend on aim of the sequencing task. The reference genome panel could span a range of genotypes or completely different viruses if a metagenomic protocol is being used. The relatively high per-read error rate will probably mean that very close variants cannot be easily distinguished at this stage. + +The mapping coordinates will be recorded based on the closest mapped reference but RAMPART will scale to a single coordinate system based on the reference genome provided the `genome.json` file. + + +#### Annotation options + +The default `annotation` pipeline has a number of options that can be specified, primarily to control the demultiplexing step. These options can be specified in the `protocol.json` --- to provide the options that are most appropriate for the lab protocol --- or in the `run_configuration.json` for customization for a particular run. They can also be specified on the command line when RAMPART is started via `--annotationOptions`. +If demuxing has been performed by guppy, then these options have no effect! + +- `require_two_barcodes` (default true) + > When true this option requires there to be the same barcode on both ends of the reads to ensure accurate demultiplexing. + +- `barcode_threshold ` (default 80) + > How good does the % identity to the best match barcode have to be to assign that barcode to the read? + +- `barcode_diff ` (default 5) + > How much better (in % identity) does the best barcode match have to be compared to the send best match. + +- `discard_unassigned` (default false) + > With this option on, any reads that are not reliably assigned a barcode (because it fails one of the above criteria) are not processed further and will not appear in RAMPART. By default these reads are processed and will appear in a category called 'unassigned'. + +- `barcode_set [native | rapid | pcr | all]` (default native) + > Specify which set of barcodes you are using. The `rapid` barcode set only uses a barcode at one end so `require_two_barcodes` should also be set to false when using these. + +- `limit_barcodes_to [BC01, BC02, ...]` (default no limits) + > Specify a list of barcodes that were used in the sequencing and limit demultiplexing to these (any others will be put in the unassigned category). The digits at the end of the barcode names are used to designate the barcodes and refer to the barcodes in the barcode set being used. + +In `protocol.json` or `run_configuration.json` you can sepecify the annotation pipeline options with a section labelled `annotationOptions`: +```json +annotationOptions: { + "require_two_barcodes": "false", + "barcode_set": "rapid", + "limit_barcodes_to": "BC04,BC05" +} +``` + +On the command line these options can be specified using the `--annotationOptions` argument with a list of option and value pairs --- +`