Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected assigment of (potential) recombinants #54

Open
MarieLataretu opened this issue Feb 20, 2024 · 18 comments
Open

Unexpected assigment of (potential) recombinants #54

MarieLataretu opened this issue Feb 20, 2024 · 18 comments

Comments

@MarieLataretu
Copy link

Hi there,

First, thanks for your work and the latest updates!

We stumbled across a few samples from the last months that pangolin assigns to a top-level lineage, namely BA.2 or XBB.1.
The nextclade calde assignment resolves to recombinant; the Nextclade_pango assignment XDD or XCT.1. Since XDD and XCT.1 were not part of the 1.23.1 pangolin-data version, it's not surprising, that pangolin does not assign these lineages.

However, we'd expect that pangolin would assign a (new) recombinant with the latest data release.
I did a little test series:

sample pangolin-data 1.23.1 pangolin-data 1.24 pangolin-data 1.25 pangolin-data 1.25.1 nextclade2 2024-01-15 nextclade3 2024-01-16 nextclade3 2024-02-16
82 BA.2 BA.2 JN.1.1 JN.1.1 XDD XDD XDS
84 BA.2 JN.1.1 JN.1.1 JN.1.1 XDD XDD XDS
85 XBB.1 JN.1.1 JN.1.1 JN.1.1 XDD XDD XDS
63 XBB.1 BA.2 BA.2 BA.2 XCT.1 XCT.1 XCT.1
30 XBB.1 XCT.1 XCT.1 XCT.1 XCT.1 XCT.1 XCT.1
51 BA.2 XDD JN.1.1 JN.1.1 XDD XDD XDD

(Tool versions: pangolin v4.3, nexclade3 v3.2.1, nextclade2 v2.14.0)

I'm wondering now, if this is a problem in pangolin - or we see an undesignated lineage. I read that Nextclade is not perfect in assigning recombinants. However, it is (more) consistent over the dataset versions.

I'm happy for any input or feedback! 🙂

Best
Marie

@MarieLataretu MarieLataretu changed the title of recombinants Unexpected assigment of (potential) recombinants Feb 20, 2024
@AngieHinrichs
Copy link
Member

Hi Marie -- without looking at the sequences, I can't say for sure what's going on. Are they in GISAID? If not, are you able to upload them to https://usher.bio/ (select the full tree of 16M sequences including GISAID and increase sample size to >= 500) in order to see which sequences they most closely resemble, and what mutations make your sequences different?

Unlike nextclade, pangolin doesn't have a general 'recombinant' category; it can only assign Pango lineages. Some things that may lead to flip-flopping assignments in successive releases are a high number of N or other ambiguous bases, or a mix of mutations associated with different lineages, whether that's due to a new recombinant, mixed infection or contamination in sequencing.

If the sequences are in GISAID, there are some very keen volunteers such as @aviczhl2, @JosetteSchoenma and @FedeGueli who search for new potential recombinants and may have already taken a look.

@FedeGueli
Copy link

FedeGueli commented Feb 20, 2024

Hi Marie -- without looking at the sequences, I can't say for sure what's going on. Are they in GISAID? If not, are you able to upload them to https://usher.bio/ (select the full tree of 16M sequences including GISAID and increase sample size to >= 500) in order to see which sequences they most closely resemble, and what mutations make your sequences different?

Unlike nextclade, pangolin doesn't have a general 'recombinant' category; it can only assign Pango lineages. Some things that may lead to flip-flopping assignments in successive releases are a high number of N or other ambiguous bases, or a mix of mutations associated with different lineages, whether that's due to a new recombinant, mixed infection or contamination in sequencing.

If the sequences are in GISAID, there are some very keen volunteers such as @aviczhl2, @JosetteSchoenma and @FedeGueli who search for new potential recombinants and may have already taken a look.

Recombinants have been tracked by @aviczhl2 @josettshoenma and @Over-There-Is i dont think there is something that went under the radar. but i can suggest to try to verify if any Epi_ISl of this putative lineage is present in sars-cov-2-variants/lineage-proposals#957 (comment) via a simple query with the github search tool or more specific looking for them on this .tsv: https://github.com/sars-cov-2-variants/lineage-proposals/blob/main/recombinants.tsv

If i can get a list of the IDs i could search for them on my own and update then here

@JosetteSchoenma
Copy link

IMO, the best way to know if a batch of samples includes recombinants (if you are not used to recognizing them in Nextclade), is to look through GitHub issues and run the mentioned GISAID queries.
Which of course takes time!

Nextclade and Pangolin will always be a bit behind and sometimes inaccurate.

But if you have a list with EPI_ISL numbers or if you could tell me which country and dates you're interested in, one of us will probably be happy to have a look.

@aviczhl2
Copy link

There are hundreds of different undesignated recombinants.
Most of them are registered in sars-cov-2-variants/lineage-proposals#991
and https://github.com/sars-cov-2-variants/lineage-proposals/blob/main/recombinants.tsv
If you see new ones, welcome to register in that repo too.

@MarieLataretu
Copy link
Author

MarieLataretu commented Feb 21, 2024

Hi all, thanks for all the feedback!

Unfortunately, only one sequence is on GISAID - I can keep you posted on that (best case, next week, I'd say).
EPI_ISL_18599826 is the 4ht sample (63 in the table)

Some things that may lead to flip-flopping assignments in successive releases are a high number of N or other ambiguous bases, or a mix of mutations associated with different lineages, whether that's due to a new recombinant, mixed infection or contamination in sequencing.

The N content is decent (below 3.9 %), and ambiguous bases are masked.

I checked the mapping and it does not look like a mixed infection.

Nextclade's qc.privateMutations.status ranges from good, to mediocre, to bad - not sure if this a good proxy for a mix of mutations of different lineages 🤔

I threw the samples in https://usher.bio/ (full tree, sample size to 1000). Here is a screenshot of the overview:
recombinants_hgPyhyloPlace

For pangolin-data 1.25.1, only one sample differs (JN.1.1 vs XDD; was XDD with 1.24)

@JosetteSchoenma
Copy link

The first 3 are linked to this singlet that @aviczhl2 found. You would have to put them all together in Nextclade to see if they match.

EPI_ISL_18715763
sars-cov-2-variants/lineage-proposals#991 (comment)

@JosetteSchoenma
Copy link

The 4th one, called BA.2 is linked to a pretty clean XCT.1 with only a reversion of C7051T. EPI_ISL_18599826

@JosetteSchoenma
Copy link

The 5th is linked to a completely normal XCT.1 from Austria. EPI_ISL_18385324

@JosetteSchoenma
Copy link

The 6th is linked to a completely normal looking XDD from France. You could check yours for mutations C6541T, A7842G, T15756A and A26275G to confirm it is an XDD.

@AngieHinrichs
Copy link
Member

Thanks for the insights @JosetteSchoenma. @MarieLataretu you can see a lot more detail about the neighboring sequences, and what mutations separate your sequences from those sequences, if you click on the 'view in Nextstrain' links.

@MarieLataretu
Copy link
Author

The 6th is linked to a completely normal looking XDD from France. You could check yours for mutations C6541T, A7842G, T15756A and A26275G to confirm it is an XDD.

I checked the four mutations (in the Nextclade output), and all 4 are present!

The subtree in Nextstrain does not show any mutations:
grafik

Do I interpret it correctly that it's indeed an XDD (most probably)?

@JosetteSchoenma
Copy link

JosetteSchoenma commented Feb 22, 2024

The 6th is linked to a completely normal looking XDD from France. You could check yours for mutations C6541T, A7842G, T15756A and A26275G to confirm it is an XDD.

I checked the four mutations (in the Nextclade output), and all 4 are present!

The subtree in Nextstrain does not show any mutations:
grafik

Do I interpret it correctly that it's indeed an XDD (most probably)?

Yes, very likely an XDD.

@MarieLataretu
Copy link
Author

The 4th one, called BA.2 is linked to a pretty clean XCT.1 with only a reversion of C7051T. EPI_ISL_18599826

Oh shoot, I overlooked that one sample is already on GISAID! 🙈

The 4th sample (63 in the table) is exactly EPI_ISL_18599826!

@MarieLataretu
Copy link
Author

The first 3 are linked to this singlet that @aviczhl2 found. You would have to put them all together in Nextclade to see if they match.

EPI_ISL_18715763 sars-cov-2-variants/lineage-proposals#991 (comment)

They are linked, but the 3 sequences have 4 additional mutations in the ORF1ab compared to EPI_ISL_18715763:

grafik

@AngieHinrichs
Copy link
Member

@MarieLataretu I would like to look into why your sixth sample (51) is not classified as XDD by recent versions of pangolin-data. Can you share the sequence (email: angie at soe dot ucsc dot edu), or if that's not allowed, update this issue with its EPI_ISL ID when it is in GISAID? Thanks!

@aviczhl2
Copy link

aviczhl2 commented Feb 22, 2024

The first 3 are linked to this singlet that @aviczhl2 found. You would have to put them all together in Nextclade to see if they match.
EPI_ISL_18715763 sars-cov-2-variants/lineage-proposals#991 (comment)

They are linked, but the 3 sequences have 4 additional mutations in the ORF1ab compared to EPI_ISL_18715763:

grafik

This looks like an independent new HV.1/JN.1 recombinant with similar breakpoint as 18715763(which is JG.3/JN.1 recomb) The "additional mutations" basically reverts the JG.3 defining and adds the HV.1 defining mutations.

@AngieHinrichs
Copy link
Member

Thanks @MarieLataretu for sharing the sample 51 sequence. It turns out that one missing mutation (or reversion to reference relative to XDD) is causing it to be placed just short of XDD in the pangolin-data 1.25.1 minimized tree.

In the minimized tree, the final node on the path to XDD has these mutations:

C6541T, G11727A, C18894T, T22926C, A26275G, C26529G, T26681C, T26833C, C29625T

sample 51 has all of those except for T22926C. If it had an N at 22926, then usher would impute a C because of all the other matches, but it has the reference allele T at 22926. So usher splits that node up, creating a new node, with all mutations except T22926C, and moving the original node (labeled XDD) to become a child of the new node with only T22926C. sample 51 also becomes a child of the new node -- a sibling of XDD, so it misses the assignment. That's the long way of saying that missing a single mutation at the final node can cause a missed assignment, unfortunately.

In the full tree, there are some XDD sequences that share the mutation G5155A with sample 51, so sample 51 is placed in XDD on that branch, with one private mutation (T21810C) and multiple reversions to reference (T21711C, C22926T, G26610A):

image

https://nextstrain.org/fetch/hgwdev.gi.ucsc.edu/~angie/pangolin-data-54.json?branchLabel=nuc%20mutations&label=id:node_6955286

How strong is the read-level evidence for sample 51 having the reference allele instead of the expected XDD mutations at reference positions 21711, 22926 and 26610? If the coverage is very low there, it would be better from the usher point of view to have N instead of reference allele.

I can make the matching a little less stringent in the next release of pangolin-data by adding a pseudo-lineage label "XDD_dropout" in the full tree, a couple nodes upstream of XDD. When minimizing the full tree to make the next release of pangolin_data, the "_dropout" will be truncated so there will be a second "XDD" label a bit upstream of where XDD really starts, and that will assign XDD a bit more broadly (hopefully not too broadly).

@MarieLataretu
Copy link
Author

Thanks for the insight, @AngieHinrichs !
I'll check the mentioned positions in detail and get back to you. (It might take some time, because I'm travelling atm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants