-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output explanations for "diamond.out", "hmmer.out","dtemp.out","overview.txt" #138
Comments
Hello Jigyasa, I am not a developer, I only made a very small code contribution as a user, so I cannot answer your questions for sure. But here is my experience as someone who uses this program regularly and is familiar with the code:
Best, |
Thanks, Aaron, for providing all these excellent answers to Jigyasa. All your answers are correct, and I am happy that you really understand run_dbcan output despite our sloppy documentation in the readme. Just some additional information:
2. hmmer.out is already parsed with the best cazyme domain hits for each query protein.
4. overview.txt is the final cazyme annotation file. It is not for CGCs. Keeping those with >=2 tools is our recommendation. Not all cazymes are located in CGCs, so those not in CGCs but with support from >=2 tools are still highly likely cazymes. Even those without EC predictions are still good cazyme candidates.
5. Yes, dbsub.out can be used to extract predicted substrates for cazymes. I do not know why you didn't get any CGCs. One possible reason is that your query genome/contig is too fragmented and no CGCs are found.
Yanbin
…________________________________
From: Aaron Oliver ***@***.***>
Sent: Wednesday, December 6, 2023 5:12 PM
To: linnabrown/run_dbcan ***@***.***>
Cc: Yanbin Yin ***@***.***>; Mention ***@***.***>
Subject: Re: [linnabrown/run_dbcan] output explanations for "diamond.out", "hmmer.out","dtemp.out","overview.txt" (Issue #138)
Non-NU Email
________________________________
Hello Jigyasa,
I am not a developer, I only made a very small code contribution as a user, so I cannot answer your questions for sure. But here is my experience as someone who uses this program regularly and is familiar with the code:
1. The diamond.out file is generated using diamond against a protein database with the parameter -k 1, which means that diamond will only return a single target CAZyme annotation per gene. It also uses a low evalue, -e 1e-102, to keep only good hits. So, this file is not filtered after running diamond and only includes the best hit.
2. The hmmer.out file includes all valid hits based on HMMs. It seems like the overall best hit used for the final annotation is based on the HMM hit with the lowest evalue and highest coverage.
3. sub.prediction.out gives substrate prediction for CGCs. Based on your response to q5, the reason you don't get this file is because you didn't get any predicted CGCs.
4. overview.txt is a summary of individual protein annotations, while cgc.out summarizes annotated gene clusters. I personally tend to use the annotation associated with each gene in cgc.out as my personal "final" annotation but I don't know if that is the recommended usage.
5. I would say that the count of X.ofTools is less important than making sure all of the annotations agree. Substrate prediction for whole clusters tends to be better than for individual enzymes. I think the larger issue is not getting any CGCs out of the software. I would try downloading a genome with CGCs from this group's dbCAN_seq database (https://bcb.unl.edu/dbCAN_seq/) and running that through your installation to make sure everything is working properly.
Best,
Aaron
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/linnabrown/run_dbcan/issues/138*issuecomment-1843837535__;Iw!!PvXuogZ4sRB2p-tU!DvtG1_QLi_ouvYsWrftlWSE0Fb6VWkChOKXPjZ7v9aOE1u3oNgmZJNkqIguBfjQgsS2bMUcREuGKVtBKJXwYIw$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEXNKZVTMGLOY5EH657CCA3YID3WNAVCNFSM6AAAAABAKDRFXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTHAZTONJTGU__;!!PvXuogZ4sRB2p-tU!DvtG1_QLi_ouvYsWrftlWSE0Fb6VWkChOKXPjZ7v9aOE1u3oNgmZJNkqIguBfjQgsS2bMUcREuGKVtDh9s6cxA$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thank you so much for replying @yinlabniu and @AaronAOliver , I am working with very fragmented plasmid genomes maybe that explains the lack of CGCs in the output. I just wanted to make sure that the code that I was using: Thank you! |
Hi @AaronAOliver and @yinlabniu , I took your advice and examined the installed software on a genome analyzed before (MGYG000002712). a) But I don't understand the columns in c) Does the I am attaching the output files for reference incase my understanding is wrong. Looking forward to your reply! |
please see #127 for answer to dbsub.out. for cgc.out, it is explained in https://bcb.unl.edu/dbCAN_seq_old/help.php. But, it is still hard to understand, that's why we made cgc_standard.out, which is simplified version of cgc.out. The cols in cgc_standard.out are CGC_id, type, contig_id, gene_id, start, end, strand, annotation. |
Thank you very much! We already rewrote our read.me in readthedoc format. Please give us any suggestions and comments for it. In addition, we have updated our tools for additional multiple functions. |
Hi @linnabrown , @yinlabniu , @AaronAOliver
Thank you for a great resource! I am looking through the output of run_dbcan script and was wondering if you could guide me in the right direction. My outputs are different from #127
script used-
run_dbcan ${IN_DIR}/${file1} prok --out_dir ${OUT_DIR}/ --db_dir ${DB_DIR}/ --use_signalP=TRUE -sp /shared/software/signalp --cgc_substrate
output-
diamond.out
- There appears to be a single output per Gene.ID. Does that mean that the file is already pre-processed for best hit and E.value and we can directly examine it without any filtration of results? It looks similar toblast.out
file in Interpretation of output results #127hmmer.out
- It has HMM profile per Gene.ID. But there are some Gene.IDs with multiple hits. What's the difference between hmmer.out and diamond.out? Is hmmer.out the final file for Gene.ID annotation or it needs to be filtered?dbsub.out
is also similar to Interpretation of output results #127 but the code I ran (as shown above) do not generatesub.prediction.out
.overview.txt
is what is mentioned in the README.md file. Is this the final file to examine the CGCs and substrates?overview.txt
file, there are 6 columns (namely,EC.
,HMMER
dbCAN_sub
DIAMOND
Signalp
X.ofTools
). How do I extract the best hit per Gene.ID? Can I say that ifX.ofTools
is more than 3, I can trust the annotation?Currently, I am filtering the
overview.txt
file to extract columns whereEC.
is not empty, and adding the substrate info fromdbsub.out
to the filteredoverview.txt
file.For example, one of the hits in
overview.txt
isGH1_e65
. This hit maps tobeta-galactan
substrate indbsub.out
file. Is that the correct way to proceed?At the same time, I don't have the CGCs output. Why do you think that's happening?
Looking forward to your reply!
Jigyasa
The text was updated successfully, but these errors were encountered: