Merge pull request #18 from nextstrain/add-phylogenetic

Add phylogenetic directory
nextstrain · Mar 1, 2024 · 9377ef6 · 9377ef6
2 parents 2a3b417 + 17beea0
commit 9377ef6
Showing 20 changed files with 356 additions and 264 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -5,5 +5,24 @@ on:
   - pull_request
 
 jobs:
-  ci:
-    uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@master
+  pathogen-ci:
+    strategy:
+      matrix:
+        runtime: [docker, conda]
+    permissions:
+      id-token: write
+    uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
+    secrets: inherit
+    with:
+      runtime: ${{ matrix.runtime }}
+      run: |
+        nextstrain build \
+          phylogenetic \
+          --configfile build-configs/ci/config.yaml
+      artifact-name: output-${{ matrix.runtime }}
+      artifact-paths: |
+        phylogenetic/auspice/
+        phylogenetic/results/
+        phylogenetic/benchmarks/
+        phylogenetic/logs/
+        phylogenetic/.snakemake/log/
diff --git a/CHANGES.md b/CHANGES.md
@@ -1,3 +1,4 @@
 # CHANGELOG
-* 11 January 2024: Use a config file to define hardcoded parameters and file paths, add a change log. [PR #9](https://github.com/nextstrain/measles/pull/9)
+* 1 March 2024: Add phylogenetic directory to follow the pathogen-repo-guide, and update the CI workflow to match the new file structure. [PR #18](https://github.com/nextstrain/measles/pull/18)
 * 14 February 2024: Add ingest directory from pathogen-repo-guide and make measles-specific modifications. [PR #10](https://github.com/nextstrain/measles/pull/10)
+* 11 January 2024: Use a config file to define hardcoded parameters and file paths, and add a change log. [PR #9](https://github.com/nextstrain/measles/pull/9)
diff --git a/README.md b/README.md
@@ -1,67 +1,25 @@
-# nextstrain.org/measles
+# Nextstrain repository for measles virus
 
-This is the [Nextstrain](https://nextstrain.org) build for measles virus, visible at
-[nextstrain.org/measles](https://nextstrain.org/measles).
+This repository contains two workflows for the analysis of measles virus data:
 
-The build encompasses fetching data, preparing it for analysis, doing quality
-control, performing analyses, and saving the results in a format suitable for
-visualization (with [auspice][]).  This involves running components of
-Nextstrain such as [augur][].
+- [`ingest/`](./ingest) - Download data from GenBank, clean and curate it
+- [`phylogenetic/`](./phylogenetic) - Filter sequences, align, construct phylogeny and export for visualization
 
-All measles-specific steps and functionality for the Nextstrain pipeline should be
-housed in this repository.
+Each folder contains a README.md with more information. The results of running both workflows are publicly visible at [nextstrain.org/measles](https://nextstrain.org/measles).
 
-[![Build Status](https://github.com/nextstrain/measles/actions/workflows/ci.yaml/badge.svg?branch=main)](https://github.com/nextstrain/measles/actions/workflows/ci.yaml)
+## Installation
 
-## Usage
+Follow the [standard installation instructions](https://docs.nextstrain.org/en/latest/install.html) for Nextstrain's suite of software tools.
 
-If you're unfamiliar with Nextstrain builds, you may want to follow our
-[quickstart guide][] first and then come back here.
+## Quickstart
 
-The easiest way to run this pathogen build is using the [Nextstrain
-command-line tool][nextstrain-cli]:
+Run the default phylogenetic workflow via:
+```
+cd phylogenetic/
+nextstrain build .
+nextstrain view .
+```
 
-    nextstrain build .
+## Documentation
 
-See the [nextstrain-cli README][] for how to install the `nextstrain` command.
-
-Alternatively, you should be able to run the build using `snakemake` within a
-suitably-configured local environment.  Details of setting that up are not yet
-well-documented, but will be in the future.
-
-Build output goes into the directories `data/`, `results/` and `auspice/`.
-
-Once you've run the build, you can view the results in auspice:
-
-    nextstrain view auspice/
-
-
-## Configuration
-
-Configuration takes place entirely with the `Snakefile`. This can be read top-to-bottom, each rule
-specifies its file inputs and output and also its parameters. There is little redirection and each
-rule should be able to be reasoned with on its own.
-
-<!--
-### fauna / RethinkDB credentials
-
-This build starts by pulling sequences from our live [fauna][] database (a RethinkDB instance). This
-requires environment variables `RETHINK_HOST` and `RETHINK_AUTH_KEY` to be set.
--->
-
-If you don't have access to our https endpoints, you can run the build using the
-example data provided in this repository.  Before running the build, copy the
-example sequences into the `data/` directory like so:
-
-    mkdir -p data/
-    cp example_data/* data/.
-
-
-[Nextstrain]: https://nextstrain.org
-<!-- [fauna]: https://github.com/nextstrain/fauna -->
-[augur]: https://github.com/nextstrain/augur
-[auspice]: https://github.com/nextstrain/auspice
-[snakemake cli]: https://snakemake.readthedocs.io/en/stable/executable.html#all-options
-[nextstrain-cli]: https://github.com/nextstrain/cli
-[nextstrain-cli README]: https://github.com/nextstrain/cli/blob/master/README.md
-[quickstart guide]: https://nextstrain.org/docs/getting-started/quickstart
+- [Running a pathogen workflow](https://docs.nextstrain.org/en/latest/tutorials/running-a-workflow.html)
diff --git a/Snakefile b/Snakefile
diff --git a/nextstrain-pathogen.yaml b/nextstrain-pathogen.yaml
@@ -0,0 +1,5 @@
+# This is currently an empty file to indicate the top level pathogen repo.
+# The inclusion of this file allows the Nextstrain CLI to run the
+# `nextstrain build` from any directory regardless of runtime.
+#
+# See https://github.com/nextstrain/cli/releases/tag/8.2.0 for more details.
diff --git a/phylogenetic/README.md b/phylogenetic/README.md
@@ -0,0 +1,50 @@
+# nextstrain.org/measles
+
+This is the [Nextstrain](https://nextstrain.org) build for measles, visible at
+[nextstrain.org/measles](https://nextstrain.org/measles).
+
+## Software requirements
+
+Follow the [standard installation instructions](https://docs.nextstrain.org/en/latest/install.html)
+for Nextstrain's suite of software tools.
+
+## Usage
+
+If you're unfamiliar with Nextstrain builds, you may want to follow our
+[Running a Pathogen Workflow guide](https://docs.nextstrain.org/en/latest/tutorials/running-a-workflow.html) first and then come back here.
+
+The easiest way to run this pathogen build is using the Nextstrain
+command-line tool from within the `phylogenetic/` directory:
+
+    cd phylogenetic/
+    nextstrain build .
+
+Build output goes into the directories `data/`, `results/` and `auspice/`.
+
+Once you've run the build, you can view the results with:
+
+    nextstrain view .
+
+## Configuration
+
+Configuration takes place entirely with the `Snakefile`. This can be read
+top-to-bottom, each rule specifies its file inputs and output and also its
+parameters. There is little redirection and each rule should be able to be
+reasoned with on its own.
+
+### Using GenBank data
+
+This build starts by pulling preprocessed sequence and metadata files from:
+
+* https://data.nextstrain.org/files/measles/sequences.fasta.zst
+* https://data.nextstrain.org/files/measles/metadata.tsv.zst
+
+The above datasets have been preprocessed and cleaned from GenBank.
+
+### Using example data
+
+Alternatively, you can run the build using the
+example data provided in this repository.  To run the build by copying the
+example sequences into the `data/` directory, use the following:
+
+    nextstrain build .  --configfile profiles/ci/profiles_config.yaml
diff --git a/phylogenetic/Snakefile b/phylogenetic/Snakefile
@@ -0,0 +1,24 @@
+configfile: "defaults/config.yaml" 
+
+rule all:
+    input:
+        auspice_json = "auspice/measles.json",
+
+include: "rules/prepare_sequences.smk"
+include: "rules/construct_phylogeny.smk"
+include: "rules/annotate_phylogeny.smk"
+include: "rules/export.smk"
+
+# Include custom rules defined in the config.
+if "custom_rules" in config:
+    for rule_file in config["custom_rules"]:
+
+        include: rule_file
+
+rule clean:
+    """Removing directories: {params}"""
+    params:
+        "results ",
+        "auspice"
+    shell:
+        "rm -rfv {params}"
diff --git a/phylogenetic/build-configs/ci/config.yaml b/phylogenetic/build-configs/ci/config.yaml
@@ -0,0 +1,7 @@
+# This configuration file contains the custom configurations parameters
+# for the CI workflow to run with the example data.
+
+# Custom rules to run as part of the CI automated workflow
+# The paths should be relative to the phylogenetic directory.
+custom_rules:
+  - build-configs/ci/copy_example_data.smk
diff --git a/phylogenetic/build-configs/ci/copy_example_data.smk b/phylogenetic/build-configs/ci/copy_example_data.smk
@@ -0,0 +1,17 @@
+rule copy_example_data:
+    input:
+        sequences="example_data/sequences.fasta",
+        metadata="example_data/metadata.tsv",
+    output:
+        sequences="data/sequences.fasta",
+        metadata="data/metadata.tsv",
+    shell:
+        """
+        cp -f {input.sequences} {output.sequences}
+        cp -f {input.metadata} {output.metadata}
+        """
+
+# Add a Snakemake ruleorder directive here if you need to resolve ambiguous rules
+# that have the same output as the copy_example_data rule.
+
+ruleorder: copy_example_data > decompress
diff --git a/config/auspice_config.json → phylogenetic/defaults/auspice_config.json b/config/auspice_config.json → phylogenetic/defaults/auspice_config.json
diff --git a/config/colors.tsv → phylogenetic/defaults/colors.tsv b/config/colors.tsv → phylogenetic/defaults/colors.tsv
diff --git a/config/config.yaml → phylogenetic/defaults/config.yaml b/config/config.yaml → phylogenetic/defaults/config.yaml
@@ -1,8 +1,8 @@
 files:
-    exclude: "config/dropped_strains.txt"
-    reference: "config/measles_reference.gb"
-    colors: "config/colors.tsv"
-    auspice_config: "config/auspice_config.json"
+    exclude: "defaults/dropped_strains.txt"
+    reference: "defaults/measles_reference.gb"
+    colors: "defaults/colors.tsv"
+    auspice_config: "defaults/auspice_config.json"
 filter: 
     group_by: "country year month"
     sequences_per_group: 20

diff --git a/config/dropped_strains.txt → phylogenetic/defaults/dropped_strains.txt b/config/dropped_strains.txt → phylogenetic/defaults/dropped_strains.txt
diff --git a/config/measles_reference.gb → phylogenetic/defaults/measles_reference.gb b/config/measles_reference.gb → phylogenetic/defaults/measles_reference.gb
diff --git a/example_data/metadata.tsv → phylogenetic/example_data/metadata.tsv b/example_data/metadata.tsv → phylogenetic/example_data/metadata.tsv
diff --git a/example_data/sequences.fasta → phylogenetic/example_data/sequences.fasta b/example_data/sequences.fasta → phylogenetic/example_data/sequences.fasta
diff --git a/phylogenetic/rules/annotate_phylogeny.smk b/phylogenetic/rules/annotate_phylogeny.smk
@@ -0,0 +1,41 @@
+"""
+This part of the workflow creates additonal annotations for the phylogenetic tree.
+
+See Augur's usage docs for these commands for more details.
+
+"""
+
+rule ancestral:
+    """Reconstructing ancestral sequences and mutations"""
+    input:
+        tree = "results/tree.nwk",
+        alignment = "results/aligned.fasta"
+    output:
+        node_data = "results/nt_muts.json"
+    params:
+        inference = config["ancestral"]["inference"]
+    shell:
+        """
+        augur ancestral \
+            --tree {input.tree} \
+            --alignment {input.alignment} \
+            --output-node-data {output.node_data} \
+            --inference {params.inference}
+        """
+
+rule translate:
+    """Translating amino acid sequences"""
+    input:
+        tree = "results/tree.nwk",
+        node_data = "results/nt_muts.json",
+        reference = config["files"]["reference"]
+    output:
+        node_data = "results/aa_muts.json"
+    shell:
+        """
+        augur translate \
+            --tree {input.tree} \
+            --ancestral-sequences {input.node_data} \
+            --reference-sequence {input.reference} \
+            --output {output.node_data} \
+        """
diff --git a/phylogenetic/rules/construct_phylogeny.smk b/phylogenetic/rules/construct_phylogeny.smk
@@ -0,0 +1,53 @@
+"""
+This part of the workflow constructs the phylogenetic tree.
+
+See Augur's usage docs for these commands for more details.
+"""
+
+rule tree:
+    """Building tree"""
+    input:
+        alignment = "results/aligned.fasta"
+    output:
+        tree = "results/tree_raw.nwk"
+    shell:
+        """
+        augur tree \
+            --alignment {input.alignment} \
+            --output {output.tree}
+        """
+
+rule refine:
+    """
+    Refining tree
+      - estimate timetree
+      - use {params.coalescent} coalescent timescale
+      - estimate {params.date_inference} node dates
+      - filter tips more than {params.clock_filter_iqd} IQDs from clock expectation
+    """
+    input:
+        tree = "results/tree_raw.nwk",
+        alignment = "results/aligned.fasta",
+        metadata = "data/metadata.tsv"
+    output:
+        tree = "results/tree.nwk",
+        node_data = "results/branch_lengths.json"
+    params:
+        coalescent = config["refine"]["coalescent"],
+        date_inference = config["refine"]["date_inference"],
+        clock_filter_iqd = config["refine"]["clock_filter_iqd"]
+    shell:
+        """
+        augur refine \
+            --tree {input.tree} \
+            --alignment {input.alignment} \
+            --metadata {input.metadata} \
+            --output-tree {output.tree} \
+            --output-node-data {output.node_data} \
+            --timetree \
+            --coalescent {params.coalescent} \
+            --date-confidence \
+            --date-inference {params.date_inference} \
+            --clock-filter-iqd {params.clock_filter_iqd}
+        """
+
diff --git a/phylogenetic/rules/export.smk b/phylogenetic/rules/export.smk
@@ -0,0 +1,31 @@
+"""
+This part of the workflow collects the phylogenetic tree and annotations to
+export a Nextstrain dataset.
+
+See Augur's usage docs for these commands for more details.
+"""
+
+rule export:
+    """Exporting data files for for auspice"""
+    input:
+        tree = "results/tree.nwk",
+        metadata = "data/metadata.tsv",
+        branch_lengths = "results/branch_lengths.json",
+        nt_muts = "results/nt_muts.json",
+        aa_muts = "results/aa_muts.json",
+        colors = config["files"]["colors"],
+        auspice_config = config["files"]["auspice_config"]
+    output:
+        auspice_json = rules.all.input.auspice_json
+    shell:
+        """
+        augur export v2 \
+            --tree {input.tree} \
+            --metadata {input.metadata} \
+            --node-data {input.branch_lengths} {input.nt_muts} {input.aa_muts} \
+            --colors {input.colors} \
+            --auspice-config {input.auspice_config} \
+            --include-root-sequence \
+            --output {output.auspice_json}
+        """
+
diff --git a/phylogenetic/rules/prepare_sequences.smk b/phylogenetic/rules/prepare_sequences.smk
@@ -0,0 +1,85 @@
+"""
+This part of the workflow prepares sequences for constructing the phylogenetic tree.
+
+See Augur's usage docs for these commands for more details.
+"""
+rule download:
+    """Downloading sequences and metadata from data.nextstrain.org"""
+    output:
+        sequences = "data/sequences.fasta.zst",
+        metadata = "data/metadata.tsv.zst"
+    params:
+        sequences_url = "https://data.nextstrain.org/files/measles/sequences.fasta.zst",
+        metadata_url = "https://data.nextstrain.org/files/measles/metadata.tsv.zst"
+    shell:
+        """
+        curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
+        curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
+        """
+
+rule decompress:
+    """Decompressing sequences and metadata"""
+    input:
+        sequences = "data/sequences.fasta.zst",
+        metadata = "data/metadata.tsv.zst"
+    output:
+        sequences = "data/sequences.fasta",
+        metadata = "data/metadata.tsv"
+    shell:
+        """
+        zstd -d -c {input.sequences} > {output.sequences}
+        zstd -d -c {input.metadata} > {output.metadata}
+        """
+
+rule filter:
+    """
+    Filtering to
+      - {params.sequences_per_group} sequence(s) per {params.group_by!s}
+      - from {params.min_date} onwards
+      - excluding strains in {input.exclude}
+      - minimum genome length of {params.min_length}
+    """
+    input:
+        sequences = "data/sequences.fasta",
+        metadata = "data/metadata.tsv",
+        exclude = config["files"]["exclude"]
+    output:
+        sequences = "results/filtered.fasta"
+    params:
+        group_by = config["filter"]["group_by"],
+        sequences_per_group = config["filter"]["sequences_per_group"],
+        min_date = config["filter"]["min_date"],
+        min_length = config["filter"]["min_length"]
+    shell:
+        """
+        augur filter \
+            --sequences {input.sequences} \
+            --metadata {input.metadata} \
+            --exclude {input.exclude} \
+            --output {output.sequences} \
+            --group-by {params.group_by} \
+            --sequences-per-group {params.sequences_per_group} \
+            --min-date {params.min_date} \
+            --min-length {params.min_length}
+        """
+
+rule align:
+    """
+    Aligning sequences to {input.reference}
+      - filling gaps with N
+    """
+    input:
+        sequences = "results/filtered.fasta",
+        reference = config["files"]["reference"]
+    output:
+        alignment = "results/aligned.fasta"
+    shell:
+        """
+        augur align \
+            --sequences {input.sequences} \
+            --reference-sequence {input.reference} \
+            --output {output.alignment} \
+            --fill-gaps \
+            --remove-reference
+        """
+