Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Won't identify TEs that are created with EDTA for non-model organism #10

Open
cahende opened this issue Mar 29, 2022 · 11 comments
Open

Comments

@cahende
Copy link

cahende commented Mar 29, 2022

Hello,

I am trying to run this for a set of raw sequences for Anopheles gambiae. I used EDTA to create a TE library from the agamP4 genome assembly and then used my raw sequences as input for this pipeline to identify which TEs are present in which samples we have. Following trimming/mapping, the pipeline attempts to identify TEs but I get the following error for every TE identified by EDTA.

Starting analysis of [TE] in [RAW DATA]-final.fastq.fused.sort.bam..

No annotaions found for: [TE]

Traceback (most recent call last):
File "/home/ch943/bin/miniconda/envs/deviaTE_env/bin/deviaTE_analyse", line 100, in
sample.write_frame(out=args.output + '.raw', insertions=ihat, command=comm, t=timestamp, norm='raw')
File "/home/ch943/bin/miniconda/envs/deviaTE_env/lib/python3.6/site-packages/deviaTE/deviaTE_pileup.py", line 204, in write_frame
with open(out, 'w') as outfile:
FileNotFoundError: [Errno 2] No such file or directory: '[RAW DATA]-final.fastq.[TE].raw'

Any guidance would be appreciated.

@W-L
Copy link
Owner

W-L commented Mar 29, 2022

Hi! Thanks for reporting this. Looks like the code has some trouble writing the results to a file. My first guesses would be:

  • the actual string of [RAW DATA] or [TE] contains some symbol that turns it into an invalid filepath, e.g. / or a space? Seems odd though if this happens for all TEs
  • Permissions of the directory that it tries to write to could be another issue, but then I would expect a different Error.

Would you mind sharing the command used to run deviaTE? And maybe double-check that the library of TE sequences is a valid fasta file?
cheers

@cahende
Copy link
Author

cahende commented Mar 29, 2022 via email

@cahende
Copy link
Author

cahende commented Mar 31, 2022 via email

@cahende
Copy link
Author

cahende commented Mar 31, 2022 via email

@W-L
Copy link
Owner

W-L commented Apr 5, 2022

Hi! Glad that your original issue was solved. deviaTE should probably check for such situations itself to be fair. I'll implement a fix for that.
Concerning your second question: If there are no reads mapping to a TE reference, then deviaTE should give a message like this:

...
******************** Analysis
Starting analysis of [TE] in [BAM-FILE]..

No reads mapped to the specified reference sequence
...

The program should then exit without producing any output. Hope this helps!
Lukas

@W-L
Copy link
Owner

W-L commented Apr 5, 2022

I added a check to replace invalid characters in TE names, which should prevent the original error (10d2b70). I'm not going to make a new release of the package at this point. But if you would like to make use of this change, you can replace the updated code file on your computer (bin/deviaTE_analyse in this repository). In case you installed the tool via conda, it should be located somewhere along the lines of:

~/miniconda3/envs/deviaTE_env/bin/deviaTE_analyse 

@cahende
Copy link
Author

cahende commented Apr 5, 2022 via email

@W-L
Copy link
Owner

W-L commented Apr 5, 2022

No problem! Forgot to mention that the fix is basically replacing problematic characters with dashes, so that the analysis can proceed without issues.
The message about "no annotations" refers to the optional parameter --annotation. This can be used to provide GFF3 files with annotations of the TE sequences, e.g. the location of CDS and other defined genetic elements. These will mainly be used in the visualisation, e.g. at the bottom of this one:
image

@cahende
Copy link
Author

cahende commented Apr 5, 2022 via email

@W-L
Copy link
Owner

W-L commented Apr 7, 2022

That's a tricky one. I think a two-pronged approach might be worth considering in this case.

You could then, for example, use a combined library of TE sequences from these with deviaTE to quantify the TE content.
A possibly helpful review with lots of links to databases & tools: https://www.nature.com/articles/s41576-018-0050-x#ref-CR77

@cahende
Copy link
Author

cahende commented Apr 7, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants