-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H5N1 D1.1 genome build #126
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jameshadfield this looks great! Tree looks reasonable, these edits look sensible, and this all makes sense to me for right now.
- I agree that getting Genoflu into the ingest pipeline will be a good path forward and help separate. out sequences for these various concatenated cattle genome trees.
- I think using the same genome reference is totally fine for this - nothing in terms of coding region structure has changed to my knowledge
- the masking is a good idea and I think the tree looks good
Updated this to use #127, which adds genoFLU genomes at the ingest stage. Very much a WIP - and hard to run! #103 and #104 would really make our life a lot easier.
Important caveat: #127 currently uses GenoFLU v1.05, not the latest (v1.06) |
4543b10
to
bae0b1f
Compare
bae0b1f
to
eee6e0a
Compare
Updated tree on staging with n=615 tips This PR is runnable from start-to-finish but you have to run ingest as well to generate the GenoFLU annotations (thanks to @joverlee521 and @jordan-ort). cd ingest
snakemake --cores 4 -pf ingest_joined_ncbi --configfile build-configs/ncbi/defaults/config.yaml
cd ..
snakemake --cores 4 --configfile config/h5n1-d1.1.yaml -pf |
…vendored-GenoFLU-multi subrepo: subdir: "ingest/vendored-GenoFLU-multi" merged: "2a548a9" upstream: origin: "https://github.com/moncla-lab/GenoFLU-multi" branch: "main" commit: "2a548a9" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"
subrepo: subdir: "ingest/vendored-GenoFLU-multi" merged: "4884266" upstream: origin: "https://github.com/moncla-lab/GenoFLU-multi" branch: "main" commit: "4884266" git-subrepo: version: "0.4.9" origin: "https://github.com/ingydotnet/git-subrepo" commit: "c06a924"
Add rules to run the vendored GenoFLU on sequences. Creates a final_metadata.tsv that has a new column `genoflu_genotype`. This is currently not directly runnable via `nextstrain build` because we do not have the GenoFLU dependencies in the Nextstrain runtimes. I was able to run this locally by installing the genoflu dependencies in my Nextstrain conda runtime with ``` mamba install -c conda-forge -c bioconda genoflu \ --prefix ~/.nextstrain/runtimes/conda/env/ \ --platform osx-64 ``` Then I could run the new rules for the NCBI data with ``` nextstrain build --conda ingest \ joined-ncbi/results/final_metadata.tsv \ --configfile build-configs/ncbi/defaults/config.yaml ```
Combined with the previous commit we now have the ability to produce 'ingest/<data-source>/results/metadata.tsv' with GenoFLU calls. This is config-controllable and is currently set up to run GenoFLU for NCBI & Andersen lab but not for Fauna. The number of threads for actually running GenoFLU has been incresed to 12 as (from my testing) each thread has low CPU & memory usage, so setting a large number of threads (even threads >> cores) improves performance. We should revisit what exactly is happening here. Note that the filtering approach for fauna may not be correct as implemented here - see <#127 (comment)> however fauna is currently not run through GenoFLU.
This builds off the GenoFLU metadata added in previous commits to build a D1.1-specific build. We reuse ~all of the cattle-outbreak machinery but at this time only build the genome trees. Due to the SRA data having limited geographic and temporal metadata we default to the divergence tree and hide the map panel. As of 2025-02-20 the build has 615 genomes of which 412 (2/3rds) have only the collection year and country.
Masks out positions in the genome which have nucleotides called in <50% of samples. This is especially important in genome builds as the terminal ends of segments are no longer terminal and thus sparse sequence data was resulting in artefactual partitioning of the tree. I've left the cattle-flu builds unchanged (i.e. no masking) but we should revisit this.
eee6e0a
to
dd107cd
Compare
The ingest here adds a I've uploaded all NCBI ingest files to S3 (via the Note that the automated ingest is currently disabled and the D1.1 phylo has not yet been added to a GitHub action |
Work in progress
Builds a A/H5N1 D1.1 genome build. Tree available on staging. Salient points:
snakemake --cores 4 --configfile config/h5n1-d1.1.yaml
update 1