Won't identify TEs that are created with EDTA for non-model organism #10

cahende · 2022-03-29T00:48:23Z

Hello,

I am trying to run this for a set of raw sequences for Anopheles gambiae. I used EDTA to create a TE library from the agamP4 genome assembly and then used my raw sequences as input for this pipeline to identify which TEs are present in which samples we have. Following trimming/mapping, the pipeline attempts to identify TEs but I get the following error for every TE identified by EDTA.

Starting analysis of [TE] in [RAW DATA]-final.fastq.fused.sort.bam..

No annotaions found for: [TE]

Traceback (most recent call last):
File "/home/ch943/bin/miniconda/envs/deviaTE_env/bin/deviaTE_analyse", line 100, in
sample.write_frame(out=args.output + '.raw', insertions=ihat, command=comm, t=timestamp, norm='raw')
File "/home/ch943/bin/miniconda/envs/deviaTE_env/lib/python3.6/site-packages/deviaTE/deviaTE_pileup.py", line 204, in write_frame
with open(out, 'w') as outfile:
FileNotFoundError: [Errno 2] No such file or directory: '[RAW DATA]-final.fastq.[TE].raw'

Any guidance would be appreciated.

W-L · 2022-03-29T07:52:38Z

Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be:

the actual string of [RAW DATA] or [TE] contains some symbol that turns it into an invalid filepath, e.g. / or a space? Seems odd though if this happens for all TEs
Permissions of the directory that it tries to write to could be another issue, but then I would expect a different Error.

Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file?
cheers

cahende · 2022-03-29T21:49:27Z

Hi, So the TE names that EDTA output actually had a "/" in all the names, so I think that is the issue. I corrected this in my reference library and am rerunning now, I will let you know if this issue persists. Thanks! Cory

…

On Tue, Mar 29, 2022 at 12:52 AM W-L ***@***.***> wrote: Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be: - the actual string of [RAW DATA] or [TE] contains some symbol that turns it into an invalid filepath, e.g. / or a space? Seems odd though if this happens for all TEs - Permissions of the directory that it tries to write to could be another issue, but then I would expect a different Error. Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file? cheers — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

cahende · 2022-03-31T18:13:53Z

Hi, The naming convention was the issue, it seems to be running fine now. On a side note - I am scanning for the presence of a large list of transposable elements and many don't have any reads mapping. Is there any way to prevent output from being produced when there are no reads mapping to a particular element? Thank you, Cory

…

On Tue, Mar 29, 2022 at 2:48 PM Cory Henderson ***@***.***> wrote: Hi, So the TE names that EDTA output actually had a "/" in all the names, so I think that is the issue. I corrected this in my reference library and am rerunning now, I will let you know if this issue persists. Thanks! Cory On Tue, Mar 29, 2022 at 12:52 AM W-L ***@***.***> wrote: > Hi! Thanks for reporting this. Looks like the code has some trouble > writing the results to a file. My first guesses would be: > > - the actual string of [RAW DATA] or [TE] contains some symbol that > turns it into an invalid filepath, e.g. / or a space? Seems odd > though if this happens for all TEs > - Permissions of the directory that it tries to write to could be > another issue, but then I would expect a different Error. > > Would you mind sharing the command used to run deviaTE? And maybe > double-check that the library of TE sequences is a valid fasta file? > cheers > > — > Reply to this email directly, view it on GitHub > <#10 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

cahende · 2022-03-31T18:20:53Z

This is the output I get for each transposable element in my test, to me this suggests there are no reads mapping to this particular TE? ******************** Analysis Starting analysis of TE_00000718_INT#LTR-unknown in SRR10235406-final.fastq.fused.sort.bam.. No annotaions found for: TE_00000718_INT#LTR-unknown Normalization: none (values are raw abundances) Analysis completed - output written to: SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown ******************** Visualization Loading data: SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown Visualization written to: SRR10235406-final.fastq.TE_00000718_INT#LTR-unknown.pdf

…

On Thu, Mar 31, 2022 at 11:13 AM Cory Henderson ***@***.***> wrote: Hi, The naming convention was the issue, it seems to be running fine now. On a side note - I am scanning for the presence of a large list of transposable elements and many don't have any reads mapping. Is there any way to prevent output from being produced when there are no reads mapping to a particular element? Thank you, Cory On Tue, Mar 29, 2022 at 2:48 PM Cory Henderson ***@***.***> wrote: > Hi, > > So the TE names that EDTA output actually had a "/" in all the names, so > I think that is the issue. I corrected this in my reference library and am > rerunning now, I will let you know if this issue persists. > > Thanks! > Cory > > On Tue, Mar 29, 2022 at 12:52 AM W-L ***@***.***> wrote: > >> Hi! Thanks for reporting this. Looks like the code has some trouble >> writing the results to a file. My first guesses would be: >> >> - the actual string of [RAW DATA] or [TE] contains some symbol that >> turns it into an invalid filepath, e.g. / or a space? Seems odd >> though if this happens for all TEs >> - Permissions of the directory that it tries to write to could be >> another issue, but then I would expect a different Error. >> >> Would you mind sharing the command used to run deviaTE? And maybe >> double-check that the library of TE sequences is a valid fasta file? >> cheers >> >> — >> Reply to this email directly, view it on GitHub >> <#10 (comment)>, or >> unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AHBUWERFMRQ2LKJCZAZFTZLVCKZFBANCNFSM5R4ZDOOQ> >> . >> You are receiving this because you are subscribed to this thread.Message >> ID: ***@***.***> >> >

W-L · 2022-04-05T10:27:27Z

Hi! Glad that your original issue was solved. deviaTE should probably check for such situations itself to be fair. I'll implement a fix for that.
Concerning your second question: If there are no reads mapping to a TE reference, then deviaTE should give a message like this:

...
******************** Analysis
Starting analysis of [TE] in [BAM-FILE]..

No reads mapped to the specified reference sequence
...

The program should then exit without producing any output. Hope this helps!
Lukas

W-L · 2022-04-05T10:54:28Z

I added a check to replace invalid characters in TE names, which should prevent the original error (10d2b70). I'm not going to make a new release of the package at this point. But if you would like to make use of this change, you can replace the updated code file on your computer (bin/deviaTE_analyse in this repository). In case you installed the tool via conda, it should be located somewhere along the lines of:

~/miniconda3/envs/deviaTE_env/bin/deviaTE_analyse

cahende · 2022-04-05T15:44:31Z

Thank you for creating a fix for that naming issue. I am still curious about the other issue where it said I had no annotations but I still received output, can you explain what that means? Cory

…

On Tue, Apr 5, 2022 at 3:54 AM W-L ***@***.***> wrote: I added a check to replace invalid characters in TE names, which should prevent the original error (10d2b70 <10d2b70>). I'm not going to make a new release of the package at this point. But if you would like to make use of this change, you can replace the updated code file on your computer (bin/deviaTE_analyse in this repository). In case you installed the tool via conda, it should be located somewhere along the lines of: ~/miniconda3/envs/deviaTE_env/bin/deviaTE_analyse — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHBUWEQDDUR7TCBEV5QWWJLVDQLW5ANCNFSM5R4ZDOOQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

W-L · 2022-04-05T16:30:39Z

No problem! Forgot to mention that the fix is basically replacing problematic characters with dashes, so that the analysis can proceed without issues.
The message about "no annotations" refers to the optional parameter --annotation. This can be used to provide GFF3 files with annotations of the TE sequences, e.g. the location of CDS and other defined genetic elements. These will mainly be used in the visualisation, e.g. at the bottom of this one:

cahende · 2022-04-05T16:57:40Z

Ahh, I see thanks for clarifying! So it is working as intended, fantastic. I also wanted to broach another more broad question since I have your attention: I am trying to identify TEs in unassembled natural genomes (not high enough coverage for a full assembly, especially for high repeat regions), so the library I am using is from TEs identified in a chromosome level genome build of a colony population. I feel like I will be missing potentially novel TEs circulating in these natural populations by using this method, which is the intent of this analysis. Can you provide any ideas on how to build a more fitting library for identification so I can identify TEs that might not be represented in the colony genome? Thank you, Cory

…

On Tue, Apr 5, 2022 at 9:30 AM W-L ***@***.***> wrote: No problem! Forgot to mention that the fix is basically replacing problematic characters with dashes, so that the analysis can proceed without issues. The message about "no annotations" refers to the optional parameter --annotation. This can be used to provide GFF3 files with annotations of the TE sequences, e.g. the location of CDS and other defined genetic elements. These will mainly be used in the visualisation, e.g. at the bottom of this one: [image: image] <https://user-images.githubusercontent.com/16755298/161801714-24779b2b-0c4d-4aeb-82e3-e7a74214f75b.png> — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHBUWERM73GXIEPMOIXTUG3VDRTDTANCNFSM5R4ZDOOQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

W-L · 2022-04-07T10:01:07Z

That's a tricky one. I think a two-pronged approach might be worth considering in this case.

Repository-based: Try and collect all relevant sequences from already existing TE databases for the species (and related ones) that you are studying
De-novo assembly of repeats from raw reads: There are quite a few tools that can do this, but I don't know for which species and coverage they are suitable. Some that come to my mind are RepeatExplorer (https://pubmed.ncbi.nlm.nih.gov/23376349/), dnaPipeTE (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419797/), REPdenovo (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792456/).

You could then, for example, use a combined library of TE sequences from these with deviaTE to quantify the TE content.
A possibly helpful review with lots of links to databases & tools: https://www.nature.com/articles/s41576-018-0050-x#ref-CR77

cahende · 2022-04-07T18:27:43Z

Thank you for the very useful information. Let me get back to you when I have had a chance to run this. I appreciate your help! Cory

…

On Thu, Apr 7, 2022 at 3:01 AM W-L ***@***.***> wrote: That's a tricky one. I think a two-pronged approach might be worth considering in this case. - Repository-based: Try and collect all relevant sequences from already existing TE databases for the species that you are studying - De-novo assembly of repeats from raw reads: There are quite a few tools that can do this, but I don't know for which species and coverage they are suitable. Some that come to my mind are RepeatExplorer ( https://pubmed.ncbi.nlm.nih.gov/23376349/), dnaPipeTE ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419797/), REPdenovo ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792456/). You could then, for example, use a combined library of TE sequences from these with deviaTE to quantify the TE content. A possibly helpful review with lots of links to databases & tools: https://www.nature.com/articles/s41576-018-0050-x#ref-CR77 — Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHBUWEUQBMXASQDZVLQMZ73VD2W65ANCNFSM5R4ZDOOQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Won't identify TEs that are created with EDTA for non-model organism #10

Won't identify TEs that are created with EDTA for non-model organism #10

cahende commented Mar 29, 2022

W-L commented Mar 29, 2022

cahende commented Mar 29, 2022 via email

cahende commented Mar 31, 2022 via email

cahende commented Mar 31, 2022 via email

W-L commented Apr 5, 2022

W-L commented Apr 5, 2022

cahende commented Apr 5, 2022 via email

W-L commented Apr 5, 2022

cahende commented Apr 5, 2022 via email

W-L commented Apr 7, 2022 •

edited

Loading

cahende commented Apr 7, 2022 via email

Won't identify TEs that are created with EDTA for non-model organism #10

Won't identify TEs that are created with EDTA for non-model organism #10

Comments

cahende commented Mar 29, 2022

W-L commented Mar 29, 2022

cahende commented Mar 29, 2022 via email

cahende commented Mar 31, 2022 via email

cahende commented Mar 31, 2022 via email

W-L commented Apr 5, 2022

W-L commented Apr 5, 2022

cahende commented Apr 5, 2022 via email

W-L commented Apr 5, 2022

cahende commented Apr 5, 2022 via email

W-L commented Apr 7, 2022 • edited Loading

cahende commented Apr 7, 2022 via email

W-L commented Apr 7, 2022 •

edited

Loading