Add 16 NSPs #1081

jameshadfield · 2023-08-21T05:59:16Z

I looked into the feasibility of adding the 16 NSPs into the exported (Auspice) dataset. This'll need nextclade v3 since RdRp includes the slip site, so perhaps a time to make some bigger changes too. (We've decided not to modify the ORF1a ORF1b annotations; discussion on slack.)

Nextclade does the translations, so we need to update the genemap.gff for Nextclade's 'sars-cov-2' dataset.
Our ancestral reconstruction of the translations (rule translate) is what creates the annotations block in the JSON. This currently uses defaults/reference_seq.gb for the annotations, and nothing else uses this.
- We can shift the reconstruction to augur ancestral, and either keep the script to generate the JSON annotations, or (preferred) just keep a JSON representation of the annotations block in the repo and use this. (We'll want to have more than just the coordinates in the JSON - we'll want to add some extra display names / colours / descriptions; the latter being important to explain why we use ORF1a + ORF1b!)
- This will allow us to remove this genbank file

Other things noticed / improvements we could make:

The workflow-config-file.rst has fallen out of date. This is seemingly inevitable with documentation, but this is a good chance to improve it.
We don't use any nextclade datasets other than 'sars-cov-2'; I assumed we'd use the 'sars-cov-2-21L' dataset for our 21L builds, and we have config settings to allow this, but I don't think we do.
rule align uses Nextalign, with a fasta + gff from the ncov repo. Why don't we replace the fasta+gff with the nextclade dataset we fetch later on in the process?
- My understanding of nextclade v3 is we'll replace nextalign with nextclade in this step anyways.
rule build_mutation_summary and rule mutation_summary seem unused. If these can be removed, we could then remove defaults/reference.seq.fasta (alignment_reference), defaults/annotation.gff (annotation). If the rules are still in use, we may want to use the nextclade dataset files anyway.
- The 2nd rule here is the only place we use the translations from rule align, so we may be able to avoid translating every genome.

The text was updated successfully, but these errors were encountered:

nextstrain-bot added this to Nextstrain planning (archived) Aug 21, 2023

github-project-automation bot moved this to New in Nextstrain planning (archived) Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 16 NSPs #1081

Add 16 NSPs #1081

jameshadfield commented Aug 21, 2023 •

edited

Loading

Add 16 NSPs #1081

Add 16 NSPs #1081

Comments

jameshadfield commented Aug 21, 2023 • edited Loading

jameshadfield commented Aug 21, 2023 •

edited

Loading