-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Loading status checks…
Merge pull request #18 from nextstrain/add-phylogenetic
Add phylogenetic directory
Showing
20 changed files
with
356 additions
and
264 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
# CHANGELOG | ||
* 11 January 2024: Use a config file to define hardcoded parameters and file paths, add a change log. [PR #9](https://github.com/nextstrain/measles/pull/9) | ||
* 1 March 2024: Add phylogenetic directory to follow the pathogen-repo-guide, and update the CI workflow to match the new file structure. [PR #18](https://github.com/nextstrain/measles/pull/18) | ||
* 14 February 2024: Add ingest directory from pathogen-repo-guide and make measles-specific modifications. [PR #10](https://github.com/nextstrain/measles/pull/10) | ||
* 11 January 2024: Use a config file to define hardcoded parameters and file paths, and add a change log. [PR #9](https://github.com/nextstrain/measles/pull/9) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,67 +1,25 @@ | ||
# nextstrain.org/measles | ||
# Nextstrain repository for measles virus | ||
|
||
This is the [Nextstrain](https://nextstrain.org) build for measles virus, visible at | ||
[nextstrain.org/measles](https://nextstrain.org/measles). | ||
This repository contains two workflows for the analysis of measles virus data: | ||
|
||
The build encompasses fetching data, preparing it for analysis, doing quality | ||
control, performing analyses, and saving the results in a format suitable for | ||
visualization (with [auspice][]). This involves running components of | ||
Nextstrain such as [augur][]. | ||
- [`ingest/`](./ingest) - Download data from GenBank, clean and curate it | ||
- [`phylogenetic/`](./phylogenetic) - Filter sequences, align, construct phylogeny and export for visualization | ||
|
||
All measles-specific steps and functionality for the Nextstrain pipeline should be | ||
housed in this repository. | ||
Each folder contains a README.md with more information. The results of running both workflows are publicly visible at [nextstrain.org/measles](https://nextstrain.org/measles). | ||
|
||
[](https://github.com/nextstrain/measles/actions/workflows/ci.yaml) | ||
## Installation | ||
|
||
## Usage | ||
Follow the [standard installation instructions](https://docs.nextstrain.org/en/latest/install.html) for Nextstrain's suite of software tools. | ||
|
||
If you're unfamiliar with Nextstrain builds, you may want to follow our | ||
[quickstart guide][] first and then come back here. | ||
## Quickstart | ||
|
||
The easiest way to run this pathogen build is using the [Nextstrain | ||
command-line tool][nextstrain-cli]: | ||
Run the default phylogenetic workflow via: | ||
``` | ||
cd phylogenetic/ | ||
nextstrain build . | ||
nextstrain view . | ||
``` | ||
|
||
nextstrain build . | ||
## Documentation | ||
|
||
See the [nextstrain-cli README][] for how to install the `nextstrain` command. | ||
|
||
Alternatively, you should be able to run the build using `snakemake` within a | ||
suitably-configured local environment. Details of setting that up are not yet | ||
well-documented, but will be in the future. | ||
|
||
Build output goes into the directories `data/`, `results/` and `auspice/`. | ||
|
||
Once you've run the build, you can view the results in auspice: | ||
|
||
nextstrain view auspice/ | ||
|
||
|
||
## Configuration | ||
|
||
Configuration takes place entirely with the `Snakefile`. This can be read top-to-bottom, each rule | ||
specifies its file inputs and output and also its parameters. There is little redirection and each | ||
rule should be able to be reasoned with on its own. | ||
|
||
<!-- | ||
### fauna / RethinkDB credentials | ||
This build starts by pulling sequences from our live [fauna][] database (a RethinkDB instance). This | ||
requires environment variables `RETHINK_HOST` and `RETHINK_AUTH_KEY` to be set. | ||
--> | ||
|
||
If you don't have access to our https endpoints, you can run the build using the | ||
example data provided in this repository. Before running the build, copy the | ||
example sequences into the `data/` directory like so: | ||
|
||
mkdir -p data/ | ||
cp example_data/* data/. | ||
|
||
|
||
[Nextstrain]: https://nextstrain.org | ||
<!-- [fauna]: https://github.com/nextstrain/fauna --> | ||
[augur]: https://github.com/nextstrain/augur | ||
[auspice]: https://github.com/nextstrain/auspice | ||
[snakemake cli]: https://snakemake.readthedocs.io/en/stable/executable.html#all-options | ||
[nextstrain-cli]: https://github.com/nextstrain/cli | ||
[nextstrain-cli README]: https://github.com/nextstrain/cli/blob/master/README.md | ||
[quickstart guide]: https://nextstrain.org/docs/getting-started/quickstart | ||
- [Running a pathogen workflow](https://docs.nextstrain.org/en/latest/tutorials/running-a-workflow.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# This is currently an empty file to indicate the top level pathogen repo. | ||
# The inclusion of this file allows the Nextstrain CLI to run the | ||
# `nextstrain build` from any directory regardless of runtime. | ||
# | ||
# See https://github.com/nextstrain/cli/releases/tag/8.2.0 for more details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# nextstrain.org/measles | ||
|
||
This is the [Nextstrain](https://nextstrain.org) build for measles, visible at | ||
[nextstrain.org/measles](https://nextstrain.org/measles). | ||
|
||
## Software requirements | ||
|
||
Follow the [standard installation instructions](https://docs.nextstrain.org/en/latest/install.html) | ||
for Nextstrain's suite of software tools. | ||
|
||
## Usage | ||
|
||
If you're unfamiliar with Nextstrain builds, you may want to follow our | ||
[Running a Pathogen Workflow guide](https://docs.nextstrain.org/en/latest/tutorials/running-a-workflow.html) first and then come back here. | ||
|
||
The easiest way to run this pathogen build is using the Nextstrain | ||
command-line tool from within the `phylogenetic/` directory: | ||
|
||
cd phylogenetic/ | ||
nextstrain build . | ||
|
||
Build output goes into the directories `data/`, `results/` and `auspice/`. | ||
|
||
Once you've run the build, you can view the results with: | ||
|
||
nextstrain view . | ||
|
||
## Configuration | ||
|
||
Configuration takes place entirely with the `Snakefile`. This can be read | ||
top-to-bottom, each rule specifies its file inputs and output and also its | ||
parameters. There is little redirection and each rule should be able to be | ||
reasoned with on its own. | ||
|
||
### Using GenBank data | ||
|
||
This build starts by pulling preprocessed sequence and metadata files from: | ||
|
||
* https://data.nextstrain.org/files/measles/sequences.fasta.zst | ||
* https://data.nextstrain.org/files/measles/metadata.tsv.zst | ||
|
||
The above datasets have been preprocessed and cleaned from GenBank. | ||
|
||
### Using example data | ||
|
||
Alternatively, you can run the build using the | ||
example data provided in this repository. To run the build by copying the | ||
example sequences into the `data/` directory, use the following: | ||
|
||
nextstrain build . --configfile profiles/ci/profiles_config.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
configfile: "defaults/config.yaml" | ||
|
||
rule all: | ||
input: | ||
auspice_json = "auspice/measles.json", | ||
|
||
include: "rules/prepare_sequences.smk" | ||
include: "rules/construct_phylogeny.smk" | ||
include: "rules/annotate_phylogeny.smk" | ||
include: "rules/export.smk" | ||
|
||
# Include custom rules defined in the config. | ||
if "custom_rules" in config: | ||
for rule_file in config["custom_rules"]: | ||
|
||
include: rule_file | ||
|
||
rule clean: | ||
"""Removing directories: {params}""" | ||
params: | ||
"results ", | ||
"auspice" | ||
shell: | ||
"rm -rfv {params}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# This configuration file contains the custom configurations parameters | ||
# for the CI workflow to run with the example data. | ||
|
||
# Custom rules to run as part of the CI automated workflow | ||
# The paths should be relative to the phylogenetic directory. | ||
custom_rules: | ||
- build-configs/ci/copy_example_data.smk |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
rule copy_example_data: | ||
input: | ||
sequences="example_data/sequences.fasta", | ||
metadata="example_data/metadata.tsv", | ||
output: | ||
sequences="data/sequences.fasta", | ||
metadata="data/metadata.tsv", | ||
shell: | ||
""" | ||
cp -f {input.sequences} {output.sequences} | ||
cp -f {input.metadata} {output.metadata} | ||
""" | ||
|
||
# Add a Snakemake ruleorder directive here if you need to resolve ambiguous rules | ||
# that have the same output as the copy_example_data rule. | ||
|
||
ruleorder: copy_example_data > decompress |
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
""" | ||
This part of the workflow creates additonal annotations for the phylogenetic tree. | ||
See Augur's usage docs for these commands for more details. | ||
""" | ||
|
||
rule ancestral: | ||
"""Reconstructing ancestral sequences and mutations""" | ||
input: | ||
tree = "results/tree.nwk", | ||
alignment = "results/aligned.fasta" | ||
output: | ||
node_data = "results/nt_muts.json" | ||
params: | ||
inference = config["ancestral"]["inference"] | ||
shell: | ||
""" | ||
augur ancestral \ | ||
--tree {input.tree} \ | ||
--alignment {input.alignment} \ | ||
--output-node-data {output.node_data} \ | ||
--inference {params.inference} | ||
""" | ||
|
||
rule translate: | ||
"""Translating amino acid sequences""" | ||
input: | ||
tree = "results/tree.nwk", | ||
node_data = "results/nt_muts.json", | ||
reference = config["files"]["reference"] | ||
output: | ||
node_data = "results/aa_muts.json" | ||
shell: | ||
""" | ||
augur translate \ | ||
--tree {input.tree} \ | ||
--ancestral-sequences {input.node_data} \ | ||
--reference-sequence {input.reference} \ | ||
--output {output.node_data} \ | ||
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
""" | ||
This part of the workflow constructs the phylogenetic tree. | ||
See Augur's usage docs for these commands for more details. | ||
""" | ||
|
||
rule tree: | ||
"""Building tree""" | ||
input: | ||
alignment = "results/aligned.fasta" | ||
output: | ||
tree = "results/tree_raw.nwk" | ||
shell: | ||
""" | ||
augur tree \ | ||
--alignment {input.alignment} \ | ||
--output {output.tree} | ||
""" | ||
|
||
rule refine: | ||
""" | ||
Refining tree | ||
- estimate timetree | ||
- use {params.coalescent} coalescent timescale | ||
- estimate {params.date_inference} node dates | ||
- filter tips more than {params.clock_filter_iqd} IQDs from clock expectation | ||
""" | ||
input: | ||
tree = "results/tree_raw.nwk", | ||
alignment = "results/aligned.fasta", | ||
metadata = "data/metadata.tsv" | ||
output: | ||
tree = "results/tree.nwk", | ||
node_data = "results/branch_lengths.json" | ||
params: | ||
coalescent = config["refine"]["coalescent"], | ||
date_inference = config["refine"]["date_inference"], | ||
clock_filter_iqd = config["refine"]["clock_filter_iqd"] | ||
shell: | ||
""" | ||
augur refine \ | ||
--tree {input.tree} \ | ||
--alignment {input.alignment} \ | ||
--metadata {input.metadata} \ | ||
--output-tree {output.tree} \ | ||
--output-node-data {output.node_data} \ | ||
--timetree \ | ||
--coalescent {params.coalescent} \ | ||
--date-confidence \ | ||
--date-inference {params.date_inference} \ | ||
--clock-filter-iqd {params.clock_filter_iqd} | ||
""" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
""" | ||
This part of the workflow collects the phylogenetic tree and annotations to | ||
export a Nextstrain dataset. | ||
See Augur's usage docs for these commands for more details. | ||
""" | ||
|
||
rule export: | ||
"""Exporting data files for for auspice""" | ||
input: | ||
tree = "results/tree.nwk", | ||
metadata = "data/metadata.tsv", | ||
branch_lengths = "results/branch_lengths.json", | ||
nt_muts = "results/nt_muts.json", | ||
aa_muts = "results/aa_muts.json", | ||
colors = config["files"]["colors"], | ||
auspice_config = config["files"]["auspice_config"] | ||
output: | ||
auspice_json = rules.all.input.auspice_json | ||
shell: | ||
""" | ||
augur export v2 \ | ||
--tree {input.tree} \ | ||
--metadata {input.metadata} \ | ||
--node-data {input.branch_lengths} {input.nt_muts} {input.aa_muts} \ | ||
--colors {input.colors} \ | ||
--auspice-config {input.auspice_config} \ | ||
--include-root-sequence \ | ||
--output {output.auspice_json} | ||
""" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
""" | ||
This part of the workflow prepares sequences for constructing the phylogenetic tree. | ||
See Augur's usage docs for these commands for more details. | ||
""" | ||
rule download: | ||
"""Downloading sequences and metadata from data.nextstrain.org""" | ||
output: | ||
sequences = "data/sequences.fasta.zst", | ||
metadata = "data/metadata.tsv.zst" | ||
params: | ||
sequences_url = "https://data.nextstrain.org/files/measles/sequences.fasta.zst", | ||
metadata_url = "https://data.nextstrain.org/files/measles/metadata.tsv.zst" | ||
shell: | ||
""" | ||
curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences} | ||
curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata} | ||
""" | ||
|
||
rule decompress: | ||
"""Decompressing sequences and metadata""" | ||
input: | ||
sequences = "data/sequences.fasta.zst", | ||
metadata = "data/metadata.tsv.zst" | ||
output: | ||
sequences = "data/sequences.fasta", | ||
metadata = "data/metadata.tsv" | ||
shell: | ||
""" | ||
zstd -d -c {input.sequences} > {output.sequences} | ||
zstd -d -c {input.metadata} > {output.metadata} | ||
""" | ||
|
||
rule filter: | ||
""" | ||
Filtering to | ||
- {params.sequences_per_group} sequence(s) per {params.group_by!s} | ||
- from {params.min_date} onwards | ||
- excluding strains in {input.exclude} | ||
- minimum genome length of {params.min_length} | ||
""" | ||
input: | ||
sequences = "data/sequences.fasta", | ||
metadata = "data/metadata.tsv", | ||
exclude = config["files"]["exclude"] | ||
output: | ||
sequences = "results/filtered.fasta" | ||
params: | ||
group_by = config["filter"]["group_by"], | ||
sequences_per_group = config["filter"]["sequences_per_group"], | ||
min_date = config["filter"]["min_date"], | ||
min_length = config["filter"]["min_length"] | ||
shell: | ||
""" | ||
augur filter \ | ||
--sequences {input.sequences} \ | ||
--metadata {input.metadata} \ | ||
--exclude {input.exclude} \ | ||
--output {output.sequences} \ | ||
--group-by {params.group_by} \ | ||
--sequences-per-group {params.sequences_per_group} \ | ||
--min-date {params.min_date} \ | ||
--min-length {params.min_length} | ||
""" | ||
|
||
rule align: | ||
""" | ||
Aligning sequences to {input.reference} | ||
- filling gaps with N | ||
""" | ||
input: | ||
sequences = "results/filtered.fasta", | ||
reference = config["files"]["reference"] | ||
output: | ||
alignment = "results/aligned.fasta" | ||
shell: | ||
""" | ||
augur align \ | ||
--sequences {input.sequences} \ | ||
--reference-sequence {input.reference} \ | ||
--output {output.alignment} \ | ||
--fill-gaps \ | ||
--remove-reference | ||
""" | ||
|