Skip to content

Commit

Permalink
Merge pull request #126 from nextstrain/james/D1.1-genome
Browse files Browse the repository at this point in the history
H5N1 D1.1 genome build
  • Loading branch information
jameshadfield authored Feb 23, 2025
2 parents 88d3b1c + dd107cd commit ba8d73c
Show file tree
Hide file tree
Showing 106 changed files with 3,905 additions and 16 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,16 @@ This should allow any reassortments to be highlighted and will also include outb

> Note that generating any segment-level build here will necessarily build the genome tree, as it's needed to identify the clade of interest in each segment.
## H5N1 D1.1 Cattle outbreak (2025)

> This build is a work in progress and relies on recent improvements to ingest which add GenoFLU constellations to the metadata TSV
The H5N1-D1.1/genome build uses a similar approach to the H5N1 Cattle Outbreak above however samples all data to a GenoFLU constellation of D1.1.

```bash
snakemake --cores 1 -pf --configfile config/h5n1-d1.1.yaml
```


## Creating a custom build
The easiest way to generate your own, custom avian-flu build is to use the quickstart-build as a starting template. Simply clone the quickstart-build, run with the example data, and edit the Snakefile to customize. This build includes example data and a simplified, heavily annotated Snakefile that goes over the structure of Snakefiles and annotates rules and inputs/outputs that can be modified. This build, with it's own readme, is available [here](https://github.com/nextstrain/avian-flu/tree/master/quickstart-build).
Expand Down
19 changes: 13 additions & 6 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,12 @@ files = rules.files.params


def subtypes_by_subtype_wildcard(wildcards):

# TODO XXX - move to configs (started in https://github.com/nextstrain/avian-flu/pull/104 but
# We should make the entire query config-definable)
if wildcards.subtype == 'h5n1-d1.1':
return "genoflu in 'D1.1'"

db = {
'h5nx': ['h5n1', 'h5n2', 'h5n3', 'h5n4', 'h5n5', 'h5n6', 'h5n7', 'h5n8', 'h5n9'],
'h5n1': ['h5n1'],
Expand All @@ -72,7 +78,7 @@ def subtypes_by_subtype_wildcard(wildcards):
db['h5n1-cattle-outbreak'] = [*db['h5nx']]
assert wildcards.subtype in db, (f"Subtype {wildcards.subtype!r} is not defined in the snakemake function "
"`subtypes_by_subtype_wildcard` -- is there a typo in the subtype you are targetting?")
return(db[wildcards.subtype])
return(f"subtype in [{', '.join([repr(s) for s in db[wildcards.subtype]])}]")

class InvalidConfigError(Exception):
pass
Expand Down Expand Up @@ -233,7 +239,7 @@ rule filter_sequences_by_subtype:
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--query "subtype in {params.subtypes!r}" \
--query {params.subtypes!r} \
--output-sequences {output.sequences}
"""

Expand All @@ -248,7 +254,7 @@ rule filter_metadata_by_subtype:
"""
augur filter \
--metadata {input.metadata} \
--query "subtype in {params.subtypes!r}" \
--query {params.subtypes!r} \
--output-metadata {output.metadata}
"""

Expand Down Expand Up @@ -633,9 +639,9 @@ rule auspice_config:
import json
with open(input.auspice_config) as fh:
auspice_config = json.load(fh)
if wildcards.subtype == "h5n1-cattle-outbreak":
if wildcards.subtype in ["h5n1-cattle-outbreak", "h5n1-d1.1"]:
if wildcards.segment == "genome":
auspice_config['display_defaults']['distance_measure'] = "num_date"
auspice_config['display_defaults']['distance_measure'] = "num_date" if wildcards.subtype == "h5n1-cattle-outbreak" else "div"
division_idx = next((i for i,c in enumerate(auspice_config['colorings']) if c['key']=='division'), None)
assert division_idx!=None, "Auspice config did not have a division coloring!"
auspice_config['colorings'].insert(division_idx+1, {
Expand Down Expand Up @@ -709,7 +715,8 @@ def auspice_name_to_wildcard_name(wildcards):
return f"results/{subtype}/{segment}/{time}/auspice-dataset.json"
if len(parts)==2:
[subtype, segment] = parts
assert subtype=='h5n1-cattle-outbreak', "Only h5n1 builds produce an Auspice dataset without a time component in the filename"
assert subtype=='h5n1-cattle-outbreak' or subtype=='h5n1-d1.1', \
"Only h5n1 builds produce an Auspice dataset without a time component in the filename"
return f"results/{subtype}/{segment}/default/auspice-dataset.json"
raise Exception("Auspice JSON filename requested with an unexpected number of (underscore-separated) parts")

Expand Down
3 changes: 3 additions & 0 deletions config/h5n1-cattle-outbreak.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,9 @@ filter:
exclude_where:
FALLBACK: host=laboratoryderived host=ferret host=unknown host=other host=host gisaid_clade=3C.2

mask:
min_support: 0 # This lets all positions through regardless of how many sequences have a base


refine:
coalescent: const
Expand Down
133 changes: 133 additions & 0 deletions config/h5n1-d1.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
#
# TKTK
#
custom_rules:
- "rules/cattle-flu.smk"


#### Parameters which define which builds to produce via this config ###
builds:
h5n1-d1.1: ''

segments:
- genome



# Input source(s) - See README.md for how to use local files instead and/or add additional inputs
inputs:
- name: ncbi
metadata: s3://nextstrain-data/files/workflows/avian-flu/h5n1/metadata.tsv.zst
sequences: s3://nextstrain-data/files/workflows/avian-flu/h5n1/{segment}/sequences.fasta.zst

#### Parameters which control large overarching aspects of the build
# Set a high target_sequences_per_tree to capture all circulating strains, as they will be pruned down
# as part of the workflow
target_sequences_per_tree: 10_000


#### Config files ####
reference: config/h5n1/reference_h5n1_{segment}.gb # use H5N1 references
genome_reference: config/h5n1-cattle-outbreak/h5_cattle_genome_root.gb # use cattle-flu genome reference TODO XXX
auspice_config: config/{subtype}/auspice_config_{subtype}.json
colors: config/h5n1/colors_h5n1.tsv # use H5N1 colors
lat_longs: config/h5n1/lat_longs_h5n1.tsv # use H5N1 lat-longs
include_strains: config/{subtype}/include_strains_{subtype}.txt
# use cattle-outbreak specific dropped strains for segment + genome trees
dropped_strains: config/{subtype}/dropped_strains_{subtype}.txt
clades_file: clade-labeling/h5n1-clades.tsv # use H5N1 clades
description: config/{subtype}/description.md


#### Rule-specific parameters ####
filter:
min_length:
FALLBACK:
pb2: 2100
pb1: 2100
pa: 2000
ha: 1600
np: 1400
na: 1270
mp: 900
ns: 800

min_date:
FALLBACK: 2024

group_by:
FALLBACK: false # no grouping during filter

exclude_where:
FALLBACK: host=laboratoryderived host=ferret host=unknown host=other host=host gisaid_clade=3C.2

mask:
min_support: 50 # This masks any position where <50% of sequences have a base

refine:
coalescent: const
date_inference: marginal

genome_clock_filter_iqd:
FALLBACK: 6
clock_filter_iqd:
FALLBACK: false

root:
FALLBACK: false

# For the genome only we use the closest outgroup as the root
# P.S. Make sure this strain is force included via augur filter --include
# (This isn't needed for the segment builds as we include a large enough time span to root via the clock)
genome_root:
FALLBACK: best

segment_lengths:
FALLBACK:
{'pb2': 2341, 'pb1': 2341, 'pa': 2233, 'ha': 1760, 'np': 1565, 'na': 1458, 'mp': 1027, 'ns': 865}

__clock_std_dev: &clock_std_dev 0.00211 # YAML anchor so we can reference this value below

clock_rates:
FALLBACK:
# The rates for the 8 segments are taken from the GISAID H5N1/2y config
pb2: [0.00287, *clock_std_dev]
pb1: [0.00264, *clock_std_dev]
pa: [0.00248, *clock_std_dev]
ha: [0.00455, *clock_std_dev]
np: [0.00252, *clock_std_dev]
na: [0.00349, *clock_std_dev]
mp: [0.00191, *clock_std_dev]
ns: [0.00249, *clock_std_dev]
# the genome clock rate is calculated by a function in the snakemake pipeline
# using the segment rates weighted by their lengths

ancestral:
inference: joint
root_seq:
FALLBACK: false
genome_root_seq:
FALLBACK: config/h5n1-cattle-outbreak/h5_cattle_genome_root.gb

traits:
# genome build has different parameters...
genome_columns:
FALLBACK: division
genome_sampling_bias_correction:
FALLBACK: 5

# segment builds:
columns:
FALLBACK: region country # same as GISAID H5N1 builds
sampling_bias_correction:
FALLBACK: false

# all builds
confidence:
FALLBACK: true

export:
genome_title:
FALLBACK: false
title:
FALLBACK: false
151 changes: 151 additions & 0 deletions config/h5n1-d1.1/auspice_config_h5n1-d1.1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
{
"title": "Full genome analysis of the ongoing influenza A/H5N1 D1.1 outbreak in North America",
"maintainers": [
{"name": "Moncla lab", "url": "https://lmoncla.github.io/monclalab/"},
{"name": "the Nextstrain team", "url": "https://nextstrain.org/team"}
],
"build_url": "https://github.com/nextstrain/avian-flu",
"data_provenance": [
{
"name": "USDA",
"url": "https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1102327"
},
{
"name": "Andersen Lab",
"url": "https://github.com/andersen-lab/avian-influenza/"
},
{
"name": "GenBank",
"url": "https://www.ncbi.nlm.nih.gov/genbank/"
}
],
"extensions": {
"nextclade": {
"pathogen": {
"schemaVersion":"3.0.0",
"defaultCds": "HA",
"cdsOrderPreference":[
"PB2",
"PB1",
"PA",
"HA",
"NP",
"NA",
"M1",
"M2",
"NS1",
"NS2"
],
"attributes": {
"name": "H5N1 D1.1 Genome analysis",
"reference name": "concatenated ancestral sequences",
"reference accession": "none"
}
}
}
},
"colorings": [
{
"key": "gt",
"title": "Genotype",
"type": "categorical"
},
{
"key": "num_date",
"title": "Date",
"type": "continuous"
},
{
"key": "region",
"title": "Region",
"type": "categorical"
},
{
"key": "country",
"title": "Country",
"type": "categorical"
},
{
"key": "division",
"title": "Admin Division",
"type": "categorical"
},
{
"key": "host",
"title": "Host",
"type": "categorical"
},
{
"key": "subtype",
"title": "Subtype",
"type": "categorical"
},
{
"key": "genoflu",
"title": "GenoFLU",
"type": "categorical"
},
{
"key": "h5_label_clade",
"title": "Provisional LABEL Clade",
"type": "categorical"
},
{
"key": "furin_cleavage_motif",
"title": "Furin Cleavage Motif",
"type": "categorical"
},
{
"key": "cleavage_site_sequence",
"title": "Cleavage Site Sequence",
"type": "categorical"
},
{
"key": "author",
"title": "Authors",
"type": "categorical"
},
{
"key": "originating_lab",
"title": "Originating Lab",
"type": "categorical"
},
{
"key": "submitting_lab",
"title": "Submitting Lab",
"type": "categorical"
},
{
"key": "data_source",
"title": "Data Source",
"type": "categorical"
}
],
"geo_resolutions": [
"region",
"country",
"division"
],
"display_defaults": {
"map_triplicate": false,
"color_by": "host",
"geo_resolution": "division",
"distance_measure": "div",
"panels": ["tree", "entropy"]
},
"filters": [
"host",
"region",
"country",
"division",
"subtype",
"author",
"originating_lab",
"submitting_lab",
"data_source"
],
"metadata_columns": [
"genbank_accession",
"sra_accessions"
]
}
14 changes: 14 additions & 0 deletions config/h5n1-d1.1/description.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else's work. Reach out if uncertain.

Genomic data from the ongoing H5N1 outbreaks in the US was shared by the [National Veterinary Services Laboratories (NVSL)](https://www.aphis.usda.gov/labs/about-nvsl) of the [Animal and Plant Health Inspection Service (APHIS)](https://www.aphis.usda.gov/) of the U.S. Department of Agriculture (USDA) in an open fashion to NCBI GenBank (consensus genomes and complete metadata) and to the SRA (raw reads with redacted metadata) in [BioProject PRJNA1102327](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1102327). Other groups have contributed sequence data here, but the majority of viral genomes have been shared by the USDA. The Andersen Lab has assembled raw reads from this SRA BioProject and publicly shared consensus genomes to [GitHub](https://github.com/andersen-lab/avian-influenza). We thank the USDA for genomic data sharing and the Andersen Lab for sharing assembled consensus genomes.

In this analysis, we've curated data from NCBI GenBank and merged this data with SRA data via the Andersen Lab GitHub repository.
We will make curated sequence & metadata files available shortly. Data source as GenBank vs SRA-via-Andersen-Lab is included in this metadata and is available as a [coloring to this page](?c=data_source).

### Limitations
Importantly, SRA-derived genomes only have the year of collection (e.g. 2024-XX-XX or 2025-XX-XX) and "USA" as collection location. In this analysis, we've inferred collection date and collection location for these samples along with confidence in date and location, however these must be treated with caution. We've added two colorings for geographic division: [one using inferred values](?c=division) and one only reporting [known values](?c=division_metadata). For these reasons we have toggled the map panel off by default.

In addition to this cattle outbreak specific view, we have broader views of H5N1 evolution available as:
- [nextstrain.org/avian-flu/h5n1/ha/2y](https://nextstrain.org/avian-flu/h5n1/ha/2y)
- [nextstrain.org/avian-flu/h5n1/na/2y](https://nextstrain.org/avian-flu/h5n1/na/2y)
- etc...
Empty file.
Empty file.
1 change: 1 addition & 0 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ rule upload_all:

include: "rules/ingest_fauna.smk"
include: "rules/merge_segment_metadata.smk"
include: "rules/genoflu.smk"
include: "rules/upload_to_s3.smk"

# Allow users to import custom rules provided via the config.
Expand Down
Loading

0 comments on commit ba8d73c

Please sign in to comment.