Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant background value in coverage files #55

Closed
gprezza opened this issue Sep 20, 2023 · 14 comments
Closed

Constant background value in coverage files #55

gprezza opened this issue Sep 20, 2023 · 14 comments

Comments

@gprezza
Copy link

gprezza commented Sep 20, 2023

Running v. 2.0.3

reademption align -p 40 --poly_a_clipping -f analysis_reademption -q -g -S
reademption coverage -f analysis_reademption -p 40

In the coverage files, a constant value seems to be added at each position of the genome, as exemplified here (top two files are fw and rev of raw coverage, bottom two are fw and rev of min normalized):

Screenshot from 2023-09-20 09-55-26

Also another user sees the same with unrelated data and Reademption run.

@konrad
Copy link
Collaborator

konrad commented Sep 26, 2023

Thanks for reporting this, @gprezza. Did you have a look at the BAM files in a genome browser and can you exclude that this the actual coverage? It would be helpful if you could provide a screenshot of the same position after loading the BAM files.

@Tillsa
Copy link
Collaborator

Tillsa commented Oct 4, 2023

@gprezza did you find the signal in the BAM files?

@gprezza
Copy link
Author

gprezza commented Oct 4, 2023

Sorry, kinda slipped out of my mind.

Screencast.2023-10-04.20.23.18.mp4

It doesn't seem to be coming from actual reads (as also shown by the coverage plot generated by IGV from the bam file). However, there is an enormous amount of what I think are spliced reads (horizontal lines in the bam track). I guess they shouldn't be the cause of the background in the wiggle files, but I also don't think this amount of spliced reads is normal.

Btw this (background in the wig files and spliced reads in the bam file) is not restricted to this locus only and happens throughout the entire genome.

edit: actually, thinking about it, maybe the coverage generating step does take into account spliced reads as actual reads spanning the entire region? That would explain the constant background.

@Tillsa
Copy link
Collaborator

Tillsa commented Oct 6, 2023

You could run the analysis without read splitting and compare with the results with the results that use read splitting. Also can you check if you have split reads, where the segments map far away from each other?

@gprezza
Copy link
Author

gprezza commented Oct 11, 2023

Sorry for the late reply.

Removing split alignment does indeed solve the issue (In the image: with split alignment on top, without on the bottom. That's a view of the whole genome btw.).

However, turning off this function is not really an option for me, since I have human RNA in these samples and splicing is happening. Just to clarify, what's shown in the image is not the human genome, but a bacterial one.

Screenshot from 2023-10-06 17-25-36

From a quick look, it doesn't seem like the split reads start and end all at the same loci. Rather they seem to simply 1) be a lot 2) have in general a distance between the two halves of tens of thousands of nucleotides.

Unfortunately, I don't have the time right now to further look into this, but I'd be happy to share these data privately if you guys want to.

@Tillsa
Copy link
Collaborator

Tillsa commented Oct 12, 2023

You could use the option --coverage_style when running the coverage subcommand and select either 'first base only' or 'last base only'. That means that only the first base or the last base of an alignment is counted for coverage calculation. However, you will lose some information in either case.
Another possible solutions is to filter out reads "manually" from the bam file that stretch too long. You could use pysam for example and extract the start and the end of each alignment to get the total length of the alignment. Then chose a threshold that makes biologically sense and discard all reads that are above this threshold.

@gprezza
Copy link
Author

gprezza commented Oct 12, 2023

Thanks for the suggestions. However, I feel like these are ways to circumvent the problem rather than solving it. I am confident that this amount of split reads is not due to actual spliced transcripts in the samples. We never see this amount of split reads in similar samples of the same organism (even when analyzed with earlier versions of reademption), so my assumption is that the split reads are somehow not being aligned properly. Discarding them as you suggest would work as a temporary solution, but equates to throwing away reads that normally would be valuable information.

@Tillsa
Copy link
Collaborator

Tillsa commented Oct 12, 2023

You could take a couple of sample split reads that appear to map too far away from each other and align them once with segemehl and once with e.g. star (also a split aware aligner) and see if the mapping results are different.

@gprezza
Copy link
Author

gprezza commented Oct 20, 2023

Sorry, but I really don't have the time to look at this right now; I'll be able to do it in one or two months. In the meantime, happy to share any data if you want to invistigate.

@gprezza
Copy link
Author

gprezza commented Dec 18, 2023

I extracted split-aligned reads and re-aligned them with segemehl and bbmap, enabling very long read splitting in both cases. The reads with very long splits are aligned by segemehl, but generally not by bbmap, so it seems to be a difference in how the two aligners operate.

The issue of the coverage files generated by reademption however remains: why are split reads counted as full, very long reads? E.g. in my video above, compare the coverage track produced by reademption (blue track) with the coverage generated by IGV from the bam file (gray track, visible from 0:14).

@Tillsa
Copy link
Collaborator

Tillsa commented Dec 19, 2023

Segemehl considers a split mapped read as one alignment. That means the the start of the alignment is the leftmost mapping position of any of the splits and the end of the alignment is the rightmost position of any of the splits. READemption builds the coverage from these start and the end positions. If the splits are very far away from each other, you will get an alignment that also has a start that is far away from its end. And READemption will produce coverage from this information. Depending on your problem that you want to solve with the coverage it might be worth to change READemptions coverage style when creating coverage files (see: https://reademption.readthedocs.io/en/latest/subcommands.html#coverage and use the option --coverage_style). Another alternative could be discarding the reads where the splits map very far away from each other, if they make no biological sense to you. You could use pysam to retrieve the read names of alignments with very long reference lengths (see: https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment.reference_length). Afterwards you can delete these reads and from your Fasta files and rerun the mapping.

@gprezza
Copy link
Author

gprezza commented Dec 20, 2023

Thanks for the quick reply!

I honestly find it hard to imagine a case where it would make sense to consider split reads as full alignments and produce coverage files accordingly. Is this something introduced in the latest reademption versions? As I mentioned before, we never saw this constant background on coverage files with similar samples analyzed with an earlier version of reademption (v0.4.3).

I guess changing the --coverage_style option should solve this, but neither option is ideal. If i may advance a feature request here, another coverage option that uses all aligned base but not the indels would be great :)

@Tillsa
Copy link
Collaborator

Tillsa commented Dec 20, 2023

READemption has always used the start and the end position of an alignment for calculating coverage. However, READemption version 0.4.3. used segemehl version 0.2.0 while newer versions use segemehl version 0.3.4. It might be that the internal split read algorithm has changed. Unfortunately, there is no option for segemehl to indicate how long the insert size between two splits should be. Sometimes segemehl even counts alignments with single-nucleotide mismatches as split alignments. I agree that an option to calculate coverage not only based on the start and stop position but based on actual mapped nucleotides would be beneficial for READemption and I created a feature request issue #57

@gprezza
Copy link
Author

gprezza commented Jan 10, 2024

Happy new year!

I aligned the same sample with READemption version 0.4.3 and 2.0.3 (allowing split reads in both cases) and can confirm that split reads are much more frequent in v 2.0.3. It seems to derive from the different segemehl versions as you said.

I'll close this issue now; let me just stress once again that a fix for the coverage files would be much appreciated :) .

@gprezza gprezza closed this as completed Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants