-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overclassification of BA.5 in pangolin 4 #7
Comments
Thanks for reporting this @donutbrew! Short version:
Longer version to follow. |
@donutbrew can you share the complete And the output of |
pango-designation is the best repository for reporting pangoLEARN assignment problems because we normally address those by adding more designated sequences to pango-designation/lineages.csv (or changing the designations in the file). However, problems with UShER analysis mode are usually not directly caused or remedied by the specific sequences in pango-designation/lineages.csv. Usually they're caused by my processes that update the UCSC/UShER tree and distill it down to the minimal tree distributed via the pangolin-data repo. There's not a proper repository for those (just a messy bucket of scripts and unpublished notes). Rarely there may also be an issue with usher or pangolin, but I think it'll almost always be a data problem. pango-designation has many watchers, and I think they are mostly interested in new lineage proposals so it would be nice to limit other traffic. I propose using the pangolin-data repository for reporting UShER mode assignment problems because that's where pangolin gets the UShER tree, and where an updated tree can hopefully address the problems. I will transfer this issue to the pangolin-data repo. |
using the latest dataIf you're looking for BA.5 sequences specifically then I suggest you use pangolin's assignment cache mode because I used the very latest fixed usher to generate it. (Also, if you are running pangolin on thousands of sequences, the assignment cache makes it faster.) To add the assignment cache to your installation of pangolin, run
Again, I recommend also running
@aineniamh has recently added BA.4 and BA.5 to scorpio/constellations; running usherThe minor bug in usher caused it to sometimes assign the lineage of a node when the sequence had almost but not quite all of the node's mutations. In the UCSC/UShER tree, BA.5 is placed on a long branch from BA.2. So as you observed, many sequences that were really more like BA.2 than BA.5 could be assigned BA.5 despite not having quite all of its mutations. That bug has been fixed in the latest usher source code, but there has not yet been a new release (and after a new release, there is also a short delay before the new release is available). If you are running on Linux and would like to try my updated usher binary, you can try it like this:
your sequencesI was able to find the hashes for 181 of the 194 IDs in local assignment cache files for v1.6, computed before and after the minor bug fix to usher. Here is a 3-column tab-separated file with those sequences' names/IDs, v1.6 pre-bugfix assignment, and v1.6 post-bugfix assignment: Here are the counts of each lineage assigned before the usher bugfix:
-- so with v1.6 data and without the bugfix, 81 are still assigned BA.5, but at least that's better than 181. :) Here are the counts of each lineage assigned after the usher bugfix:
The lone sequence still assigned BA.5 is SouthAfrica/CERI-KRISP-K038411/2022 (EPI_ISL_11621351). Nextclade also calls that BA.5 but with 16 reversions (including T22917G/S:L452R and T23018G/S:F486V) -- that's a lot. I exclude any sequence from the big UCSC tree if Nextclade assigns it an Omicron lineage but has more than 5 reversions. I guess although the sequence fits poorly with the BA.5 node, it fits better there than at any other node in the Nextclade and minimal UCSC/UShER trees. |
Thanks Angie (I somehow missed this repo, so thanks also for redirecting) To follow up @corneliusroemer here is the version output:
I've run pangolin in several ways. My understanding was that running with
|
Hi @donutbrew, we have scorpio in use to give exact SNP-threshold based assignments for VOCs specifically. I'd recommend not skipping it and am curious why you think it should be the default setting? |
@aineniamh that recommendation is my fault. an earlier version of scorpio was overwriting recombinants and a few other lineages and --skip-scorpio was recommended. We are testing the upgraded version both ways to see how they perform |
Probably over-discussion and under-sleep. :) Thanks for the explanations here. We'll test the new versions soon. |
Ah that makes sense! The XE recombinant constellation files have been added into latest versions, but I understand now. Recombinant assignments are tricky- if scorpio does overwrite, the notes column will report what the original assignment was too. I know that's not ideal, but at least gives a start for the moment! |
I ran your sequences threw Nextclade and they are almost all problematic - they all have far too many reversions. So I wouldn't really put too much weight on any lineage assignment. It's just a guess, hard to say anything definite. So I guess, yes, Usher should maybe not have called most of these BA.5 but then this is a bit of a case of garbage in garbage out. |
(Not sure if this is the right issue tracker for this, so please direct me to the right place if not)
Currently, the a large number of sequences that Pangolin 4.05 (data 1.3, --skip-scorpio) classified as BA.5 are missing both S:L452R and S:F486V. (194 BA.5 accessions with Wuhan alleles attached).
Removing --skip-scorpio pushes 111 of these into BA.2, 24 into BA.3, 1 into BA.1, and 44 into Unassigned, and 14 remain BA.5.
What is going on here? I'm now a little confused as to whether
--skip-scorpio
should be the default behavior or not. Happy to have some discussion.accessions.txt
@AngieHinrichs
The text was updated successfully, but these errors were encountered: