Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling adapter sequences #101

Open
martinghunt opened this issue Oct 11, 2022 · 9 comments
Open

Handling adapter sequences #101

martinghunt opened this issue Oct 11, 2022 · 9 comments

Comments

@martinghunt
Copy link
Member

We need to handle adapter sequences.

Observed behaviour:

  1. amplicon N is ok, amplicon N+1 is dropped.
  2. reads at the end of amplicon N have adapters.
  3. a few bp of adapter are included in the consensus output by cylon, so the consensus is amplicon + (a little adapter sequence)
  4. the reads all map fine to the consensus, including the adapters because the consensus now has some adapter seq
  5. self-QC does not mask the adapters

Desired behaviour: the adapter sequence is masked. I think self-qc could do this. An alternative would be to remove adapters from the reads earlier on in the pipeline so they are never seen again.

Example is the same as in #99. Please also see #100 for more detail on the amplicon in question.

The adapter sequence is included in the consensus at the end of amplicon SARS-CoV-2_76. All the reads there end either with the primer, or with the primer plus a few bp of adapter. None of them should contribute to Clean.Tot.cons at the positions of the primer or past the end of the primer.

First columns of all_stats.tsv at the end of the primer -- start of adapter -- start of dropped amplicon:

23055    A         23030     A          335  0    2    3    2    342       335             7
23056    A         23031     A          322  0    2    0    0    324       322             2
23057    C         23032     C          1    309  0    4    2    316       309             7
23058    C         23033     A          170  1    0    0    21   192       170             22
23058    -         23034     G          2    2    181  0    1    186       181             5
23059    C         23035     C          1    171  0    4    1    177       171             6
23060    A         23036     A          169  2    0    0    0    171       169             2
23061    C         23037     A          164  0    0    0    0    164       164             0
23062    T         23038     T          0    0    2    155  0    157       155             2
23063    A         23039     A          141  2    0    0    0    143       141             2
23064    A         23040     N          0    0    0    0    0    0         0               0
23065    T         23041     N          0    0    0    0    0    0         0               0
23066    G         23042     N          0    0    0    0    0    0         0               0
23067    G         23043     N          0    0    0    0    0    0         0               0
23068    T         23044     N          0    0    0    0    0    0         0               0
@jeff-k
Copy link
Contributor

jeff-k commented Oct 11, 2022

I've added an additional bounds check in self_qc to the read position vs. the start of the left primer and the end of the right primer that should fix this as long as the adapter sequence doesn't extend too far beyond the primer match distance threshold (otherwise we have the unfortunate such as in issue #99)

Can you point me to some fastqs where this appears? I'd like to test the solution

@martinghunt
Copy link
Member Author

Example is the same as in #99. Please also see #100 for more detail on the amplicon in question.

@jeff-k
Copy link
Contributor

jeff-k commented Oct 12, 2022

@iqbal-lab I understand that there are a number of real world samples where this has produced artefacts, would it be possible for me to see them?

@iqbal-lab
Copy link
Contributor

Sure, can you confirm first that this works on the simulated data?

@jeff-k
Copy link
Contributor

jeff-k commented Oct 12, 2022

Yes, I am happy about the bounds checking fix for this one example.

Since this is a synthetic test case that deviates from our QC model of real-world data in a relevant way (many nanopore reads of exactly matching lengths and adapter noise) I would like to confirm that the fix properly satisfies the real-world conditions.

@iqbal-lab
Copy link
Contributor

  1. Well, we're about to dive in and dissect 12000 S African samples via pileups and the TSV to check everything works, so this is definitely going to be tested on real world data as soon as it is committed.
  2. i dont think this issue is fundamentally about many reads exactly matching lengths, is about not labelling adapters as true sequence, so i think the fix you've implemented does need to go in.

Anyway, here's a sample ERR8959196 from the oxford validation set which has loads of adapters.

@jeff-k
Copy link
Contributor

jeff-k commented Oct 12, 2022

Thanks for the sample.

The instances of this artefact seem a bit marginal to me:

Primer, Base position, count of occurrences
...
(Primer(name='SARS-CoV-2_9_RIGHT', seq='CACAGGCGAACTCATTTACTTCTG', left=False, forward=False, ref_start=2861, ref_end=2884), 2886) 2
(Primer(name='SARS-CoV-2_9_RIGHT', seq='CACAGGCGAACTCATTTACTTCTG', left=False, forward=False, ref_start=2861, ref_end=2884), 2887) 1
(Primer(name='SARS-CoV-2_9_RIGHT', seq='CACAGGCGAACTCATTTACTTCTG', left=False, forward=False, ref_start=2861, ref_end=2884), 2888) 1
(Primer(name='SARS-CoV-2_56_RIGHT', seq='TGACTCTTACCAGTACCAGGTGG', left=False, forward=False, ref_start=17082, ref_end=17104), 17105) 13
(Primer(name='SARS-CoV-2_56_RIGHT', seq='TGACTCTTACCAGTACCAGGTGG', left=False, forward=False, ref_start=17082, ref_end=17104), 17106) 1
(Primer(name='SARS-CoV-2_86_RIGHT', seq='TCAATTGAGTTGAGTACAGCTGGT', left=False, forward=False, ref_start=26026, ref_end=26049), 26050) 55
(Primer(name='SARS-CoV-2_86_RIGHT', seq='TCAATTGAGTTGAGTACAGCTGGT', left=False, forward=False, ref_start=26026, ref_end=26049), 26051) 2
...

The last column is the count of bases attributed to this artefact. I do not see a difference in any calls made in the final VCF. Do you have a list of the positions where you have observed this artefact?

@jeff-k
Copy link
Contributor

jeff-k commented Oct 12, 2022

  1. Well, we're about to dive in and dissect 12000 S African samples via pileups and the TSV to check everything works, so this is definitely going to be tested on real world data as soon as it is committed.

No, I am sorry but I do not feel comfortable with committing untested code. In my experience this can lead to confusion where it can be tricky to separate issues raised by experimental results from true-positive bugs.

  1. i dont think this issue is fundamentally about many reads exactly matching lengths, is about not labelling adapters as true sequence, so i think the fix you've implemented does need to go in.

We have a quality control model that is designed to help with adapter artefacts but it does not expect to see a lot of nanopore reads that have the same exact lengths. One way to avoid speculation about this stuff is to look at some examples.

Is it the case that we have observed instances of this issue in real world data? If not I would be happy to move the more open-ended data exploration to a different venue.

@iqbal-lab
Copy link
Contributor

OK this issue is on hold until Friday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants