Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H5N1 D1.1 genome build #126

Merged
merged 7 commits into from
Feb 23, 2025
Merged

H5N1 D1.1 genome build #126

merged 7 commits into from
Feb 23, 2025

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Feb 13, 2025

Work in progress

Builds a A/H5N1 D1.1 genome build. Tree available on staging. Salient points:

  • GenoFLU results are hardcoded, so new strains (in fauna) will never make it into this build without regenerating this TSV. I think the medium-term approach is to add GeoFLU to the ingest workflow. Currently there are n=238 D1.1 strains.
  • Genome reference is the same as we used for the (H5N1) cattle-flu outbreak
  • Tree is temporally rooted
  • The snakemake code is very ad-hoc. Rather than invest time in cleaning it up I think we should be working on top of Rewrite config syntax #104.
  • Analysis command: snakemake --cores 4 --configfile config/h5n1-d1.1.yaml
  • I haven't spent much time looking into the data or the tree yet. Don't rely on things being perfect at this stage!

update 1

  • We now mask out positions where <50% of samples have a base called - 37f11b4 - which fixes the tree topology significantly. These masked bases are completely hidden from the analysis.
  • We manually correct the collection date for 1 sample - 98a3f5e

Copy link
Collaborator

@lmoncla lmoncla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jameshadfield this looks great! Tree looks reasonable, these edits look sensible, and this all makes sense to me for right now.

  • I agree that getting Genoflu into the ingest pipeline will be a good path forward and help separate. out sequences for these various concatenated cattle genome trees.
  • I think using the same genome reference is totally fine for this - nothing in terms of coding region structure has changed to my knowledge
  • the masking is a good idea and I think the tree looks good

@jameshadfield
Copy link
Member Author

jameshadfield commented Feb 17, 2025

Updated this to use #127, which adds genoFLU genomes at the ingest stage. Very much a WIP - and hard to run! #103 and #104 would really make our life a lot easier.

Important caveat: #127 currently uses GenoFLU v1.05, not the latest (v1.06)

@jameshadfield
Copy link
Member Author

Updated tree on staging with n=615 tips

This PR is runnable from start-to-finish but you have to run ingest as well to generate the GenoFLU annotations (thanks to @joverlee521 and @jordan-ort).

cd ingest
snakemake --cores 4 -pf ingest_joined_ncbi --configfile build-configs/ncbi/defaults/config.yaml
cd ..
snakemake --cores 4 --configfile config/h5n1-d1.1.yaml -pf

@jameshadfield jameshadfield changed the title WIP H5N1 D1.1 genome build H5N1 D1.1 genome build Feb 20, 2025
joverlee521 and others added 7 commits February 21, 2025 15:04
…vendored-GenoFLU-multi

subrepo:
  subdir:   "ingest/vendored-GenoFLU-multi"
  merged:   "2a548a9"
upstream:
  origin:   "https://github.com/moncla-lab/GenoFLU-multi"
  branch:   "main"
  commit:   "2a548a9"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
subrepo:
  subdir:   "ingest/vendored-GenoFLU-multi"
  merged:   "4884266"
upstream:
  origin:   "https://github.com/moncla-lab/GenoFLU-multi"
  branch:   "main"
  commit:   "4884266"
git-subrepo:
  version:  "0.4.9"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "c06a924"
Add rules to run the vendored GenoFLU on sequences. Creates a
final_metadata.tsv that has a new column `genoflu_genotype`.

This is currently not directly runnable via `nextstrain build` because
we do not have the GenoFLU dependencies in the Nextstrain runtimes.

I was able to run this locally by installing the genoflu dependencies
in my Nextstrain conda runtime with

```
mamba install -c conda-forge -c bioconda genoflu \
    --prefix ~/.nextstrain/runtimes/conda/env/ \
    --platform osx-64
```

Then I could run the new rules for the NCBI data with
```
nextstrain build --conda ingest \
    joined-ncbi/results/final_metadata.tsv \
    --configfile build-configs/ncbi/defaults/config.yaml
```
Combined with the previous commit we now have the ability to produce
'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls.
This is config-controllable and is currently set up to run GenoFLU
for NCBI & Andersen lab but not for Fauna.

The number of threads for actually running GenoFLU has been incresed to
12 as (from my testing) each thread has low CPU & memory usage, so
setting a large number of threads (even threads >> cores) improves
performance. We should revisit what exactly is happening here.

Note that the filtering approach for fauna may not be correct as
implemented here - see <#127 (comment)>
however fauna is currently not run through GenoFLU.
This builds off the GenoFLU metadata added in previous commits to build
a D1.1-specific build. We reuse ~all of the cattle-outbreak machinery
but at this time only build the genome trees.

Due to the SRA data having limited geographic and temporal metadata
we default to the divergence tree and hide the map panel. As of
2025-02-20 the build has 615 genomes of which 412 (2/3rds) have only
the collection year and country.
Masks out positions in the genome which have nucleotides called in <50%
of samples. This is especially important in genome builds as the
terminal ends of segments are no longer terminal and thus sparse
sequence data was resulting in artefactual partitioning of the tree.

I've left the cattle-flu builds unchanged (i.e. no masking) but we
should revisit this.
@jameshadfield
Copy link
Member Author

The ingest here adds a genoflu column to the metadata TSV which the phylo workflow relies on.

I've uploaded all NCBI ingest files to S3 (via the upload_all_ncbi rule) and also updated the core avian-flu/h5n1-d1.1/genome dataset, now with n=617 genomes (access is pending nextstrain/nextstrain.org#1119).

Note that the automated ingest is currently disabled and the D1.1 phylo has not yet been added to a GitHub action

@jameshadfield jameshadfield merged commit ba8d73c into master Feb 23, 2025
6 checks passed
@jameshadfield jameshadfield deleted the james/D1.1-genome branch February 23, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants