H5N1 D1.1 genome build #126

jameshadfield · 2025-02-13T23:17:31Z

Work in progress

Builds a A/H5N1 D1.1 genome build. Tree available on staging. Salient points:

GenoFLU results are hardcoded, so new strains (in fauna) will never make it into this build without regenerating this TSV. I think the medium-term approach is to add GeoFLU to the ingest workflow. Currently there are n=238 D1.1 strains.
Genome reference is the same as we used for the (H5N1) cattle-flu outbreak
Tree is temporally rooted
The snakemake code is very ad-hoc. Rather than invest time in cleaning it up I think we should be working on top of Rewrite config syntax #104.
Analysis command: snakemake --cores 4 --configfile config/h5n1-d1.1.yaml
I haven't spent much time looking into the data or the tree yet. Don't rely on things being perfect at this stage!

update 1

We now mask out positions where <50% of samples have a base called - 37f11b4 - which fixes the tree topology significantly. These masked bases are completely hidden from the analysis.
We manually correct the collection date for 1 sample - 98a3f5e

lmoncla

Thank you @jameshadfield this looks great! Tree looks reasonable, these edits look sensible, and this all makes sense to me for right now.

I agree that getting Genoflu into the ingest pipeline will be a good path forward and help separate. out sequences for these various concatenated cattle genome trees.
I think using the same genome reference is totally fine for this - nothing in terms of coding region structure has changed to my knowledge
the masking is a good idea and I think the tree looks good

rules/cattle-flu.smk

jameshadfield · 2025-02-17T00:38:46Z

Updated this to use #127, which adds genoFLU genomes at the ingest stage. Very much a WIP - and hard to run! #103 and #104 would really make our life a lot easier.

GISAID analysis, n=236
NCBI / SRA analysis n=209, including the n=5 2025 cattle/human samples

Important caveat: #127 currently uses GenoFLU v1.05, not the latest (v1.06)

jameshadfield · 2025-02-20T02:34:04Z

Updated tree on staging with n=615 tips

This PR is runnable from start-to-finish but you have to run ingest as well to generate the GenoFLU annotations (thanks to @joverlee521 and @jordan-ort).

cd ingest
snakemake --cores 4 -pf ingest_joined_ncbi --configfile build-configs/ncbi/defaults/config.yaml
cd ..
snakemake --cores 4 --configfile config/h5n1-d1.1.yaml -pf

config/h5n1-d1.1.yaml

…vendored-GenoFLU-multi subrepo: subdir: "ingest/vendored-GenoFLU-multi" merged: "2a548a9" upstream: origin: "https://github.com/moncla-lab/GenoFLU-multi" branch: "main" commit: "2a548a9" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

subrepo: subdir: "ingest/vendored-GenoFLU-multi" merged: "4884266" upstream: origin: "https://github.com/moncla-lab/GenoFLU-multi" branch: "main" commit: "4884266" git-subrepo: version: "0.4.9" origin: "https://github.com/ingydotnet/git-subrepo" commit: "c06a924"

Add rules to run the vendored GenoFLU on sequences. Creates a final_metadata.tsv that has a new column `genoflu_genotype`. This is currently not directly runnable via `nextstrain build` because we do not have the GenoFLU dependencies in the Nextstrain runtimes. I was able to run this locally by installing the genoflu dependencies in my Nextstrain conda runtime with ``` mamba install -c conda-forge -c bioconda genoflu \ --prefix ~/.nextstrain/runtimes/conda/env/ \ --platform osx-64 ``` Then I could run the new rules for the NCBI data with ``` nextstrain build --conda ingest \ joined-ncbi/results/final_metadata.tsv \ --configfile build-configs/ncbi/defaults/config.yaml ```

Combined with the previous commit we now have the ability to produce 'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls. This is config-controllable and is currently set up to run GenoFLU for NCBI & Andersen lab but not for Fauna. The number of threads for actually running GenoFLU has been incresed to 12 as (from my testing) each thread has low CPU & memory usage, so setting a large number of threads (even threads >> cores) improves performance. We should revisit what exactly is happening here. Note that the filtering approach for fauna may not be correct as implemented here - see <#127 (comment)> however fauna is currently not run through GenoFLU.

This builds off the GenoFLU metadata added in previous commits to build a D1.1-specific build. We reuse ~all of the cattle-outbreak machinery but at this time only build the genome trees. Due to the SRA data having limited geographic and temporal metadata we default to the divergence tree and hide the map panel. As of 2025-02-20 the build has 615 genomes of which 412 (2/3rds) have only the collection year and country.

Masks out positions in the genome which have nucleotides called in <50% of samples. This is especially important in genome builds as the terminal ends of segments are no longer terminal and thus sparse sequence data was resulting in artefactual partitioning of the tree. I've left the cattle-flu builds unchanged (i.e. no masking) but we should revisit this.

jameshadfield · 2025-02-23T20:57:58Z

The ingest here adds a genoflu column to the metadata TSV which the phylo workflow relies on.

I've uploaded all NCBI ingest files to S3 (via the upload_all_ncbi rule) and also updated the core avian-flu/h5n1-d1.1/genome dataset, now with n=617 genomes (access is pending nextstrain/nextstrain.org#1119).

Note that the automated ingest is currently disabled and the D1.1 phylo has not yet been added to a GitHub action

lmoncla reviewed Feb 14, 2025

View reviewed changes

AngieHinrichs reviewed Feb 14, 2025

View reviewed changes

rules/cattle-flu.smk Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/D1.1-genome branch from 4543b10 to bae0b1f Compare February 20, 2025 02:08

jameshadfield mentioned this pull request Feb 20, 2025

WIP: Ingest with GenoFLU #127

Closed

1 task

jameshadfield force-pushed the james/D1.1-genome branch from bae0b1f to eee6e0a Compare February 20, 2025 02:16

jameshadfield commented Feb 20, 2025

View reviewed changes

config/h5n1-d1.1.yaml Outdated Show resolved Hide resolved

jameshadfield changed the title ~~WIP H5N1 D1.1 genome build~~ H5N1 D1.1 genome build Feb 20, 2025

This was referenced Feb 20, 2025

Add avian-flu/h5n1-d1.1/genome core build nextstrain/nextstrain.org#1119

Merged

Fix cattle-outbreak strain selection #133

Closed

joverlee521 and others added 7 commits February 21, 2025 15:04

[D1.1] add to README

dd107cd

jameshadfield force-pushed the james/D1.1-genome branch from eee6e0a to dd107cd Compare February 23, 2025 20:47

jameshadfield merged commit ba8d73c into master Feb 23, 2025
6 checks passed

jameshadfield deleted the james/D1.1-genome branch February 23, 2025 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H5N1 D1.1 genome build #126

H5N1 D1.1 genome build #126

jameshadfield commented Feb 13, 2025 •

edited

Loading

lmoncla left a comment •

edited

Loading

jameshadfield commented Feb 17, 2025 •

edited

Loading

jameshadfield commented Feb 20, 2025

jameshadfield commented Feb 23, 2025

H5N1 D1.1 genome build #126

H5N1 D1.1 genome build #126

Conversation

jameshadfield commented Feb 13, 2025 • edited Loading

lmoncla left a comment • edited Loading

Choose a reason for hiding this comment

jameshadfield commented Feb 17, 2025 • edited Loading

jameshadfield commented Feb 20, 2025

jameshadfield commented Feb 23, 2025

jameshadfield commented Feb 13, 2025 •

edited

Loading

lmoncla left a comment •

edited

Loading

jameshadfield commented Feb 17, 2025 •

edited

Loading