Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output explanations for "diamond.out", "hmmer.out","dtemp.out","overview.txt" #138

Closed
Jigyasa3 opened this issue Dec 6, 2023 · 6 comments

Comments

@Jigyasa3
Copy link

Jigyasa3 commented Dec 6, 2023

Hi @linnabrown , @yinlabniu , @AaronAOliver

Thank you for a great resource! I am looking through the output of run_dbcan script and was wondering if you could guide me in the right direction. My outputs are different from #127

script used-
run_dbcan ${IN_DIR}/${file1} prok --out_dir ${OUT_DIR}/ --db_dir ${DB_DIR}/ --use_signalP=TRUE -sp /shared/software/signalp --cgc_substrate

output-

  1. diamond.out- There appears to be a single output per Gene.ID. Does that mean that the file is already pre-processed for best hit and E.value and we can directly examine it without any filtration of results? It looks similar to blast.out file in Interpretation of output results #127
  2. hmmer.out- It has HMM profile per Gene.ID. But there are some Gene.IDs with multiple hits. What's the difference between hmmer.out and diamond.out? Is hmmer.out the final file for Gene.ID annotation or it needs to be filtered?
  3. dbsub.out is also similar to Interpretation of output results #127 but the code I ran (as shown above) do not generate sub.prediction.out.
  4. overview.txt is what is mentioned in the README.md file. Is this the final file to examine the CGCs and substrates?
  5. In overview.txt file, there are 6 columns (namely, EC., HMMER dbCAN_sub DIAMOND Signalp X.ofTools). How do I extract the best hit per Gene.ID? Can I say that if X.ofTools is more than 3, I can trust the annotation?
    Currently, I am filtering the overview.txt file to extract columns where EC. is not empty, and adding the substrate info from dbsub.out to the filtered overview.txt file.
    For example, one of the hits in overview.txt is GH1_e65. This hit maps to beta-galactan substrate in dbsub.out file. Is that the correct way to proceed?
    At the same time, I don't have the CGCs output. Why do you think that's happening?

Looking forward to your reply!
Jigyasa

@AaronAOliver
Copy link
Contributor

Hello Jigyasa,

I am not a developer, I only made a very small code contribution as a user, so I cannot answer your questions for sure. But here is my experience as someone who uses this program regularly and is familiar with the code:

  1. The diamond.out file is generated using diamond against a protein database with the parameter -k 1, which means that diamond will only return a single target CAZyme annotation per gene. It also uses a low evalue, -e 1e-102, to keep only good hits. So, this file is not filtered after running diamond and only includes the best hit.

  2. The hmmer.out file includes all valid hits based on HMMs. It seems like the overall best hit used for the final annotation is based on the HMM hit with the lowest evalue and highest coverage.

  3. sub.prediction.out gives substrate prediction for CGCs. Based on your response to q5, the reason you don't get this file is because you didn't get any predicted CGCs.

  4. overview.txt is a summary of individual protein annotations, while cgc.out summarizes annotated gene clusters. I personally tend to use the annotation associated with each gene in cgc.out as my personal "final" annotation but I don't know if that is the recommended usage.

  5. I would say that the count of X.ofTools is less important than making sure all of the annotations agree. Substrate prediction for whole clusters tends to be better than for individual enzymes. I think the larger issue is not getting any CGCs out of the software. I would try downloading a genome with CGCs from this group's dbCAN_seq database (https://bcb.unl.edu/dbCAN_seq/) and running that through your installation to make sure everything is working properly.

Best,
Aaron

@yinlabniu
Copy link
Collaborator

yinlabniu commented Dec 7, 2023 via email

@Jigyasa3
Copy link
Author

Jigyasa3 commented Dec 7, 2023

Thank you so much for replying @yinlabniu and @AaronAOliver ,

I am working with very fragmented plasmid genomes maybe that explains the lack of CGCs in the output. I just wanted to make sure that the code that I was using: run_dbcan ${IN_DIR}/${file1} prok --out_dir ${OUT_DIR}/ --db_dir ${DB_DIR}/ --use_signalP=TRUE -sp /shared/software/signalp --cgc_substrate was correct.
I really appreciate the detailed replies. I will definitively test on a positive control sample to make sure that my version of installation is correct.

Thank you!

@Jigyasa3
Copy link
Author

Jigyasa3 commented Dec 9, 2023

Hi @AaronAOliver and @yinlabniu ,

I took your advice and examined the installed software on a genome analyzed before (MGYG000002712).
I am getting the cgc.out file and dbsub.out file. So the code works!

a) But I don't understand the columns in cgc.out file and how they would connect to the dbsub.out file. For example, the software finds CGC1 to contain MGYG000002712_77_9, the only protein with substrate annotation. But this protein has multiple domains GH5_e273 and CBM2_e118 which get annotated to degrade different substrates. Which one should be used?
b) What are the columns names in cgc.out file?
Are the columns 7 and 8 genomic positions?
What does the column 11 annotation DB=gnl|TC-DB|Q48476|3.A.1.103.1;ID=MGYG000002712_5_16 DB=gnl|TC-DB|Q48476| means?

c) Does the cgc.out file needs to be filtered or can I summarize the results from this file as it is?

I am attaching the output files for reference incase my understanding is wrong.
dbsub.out.txt
cgc.out.txt

Looking forward to your reply!

@yinlabniu
Copy link
Collaborator

Hi @AaronAOliver and @yinlabniu ,

I took your advice and examined the installed software on a genome analyzed before (MGYG000002712). I am getting the cgc.out file and dbsub.out file. So the code works!

a) But I don't understand the columns in cgc.out file and how they would connect to the dbsub.out file. For example, the software finds CGC1 to contain MGYG000002712_77_9, the only protein with substrate annotation. But this protein has multiple domains GH5_e273 and CBM2_e118 which get annotated to degrade different substrates. Which one should be used? b) What are the columns names in cgc.out file? Are the columns 7 and 8 genomic positions? What does the column 11 annotation DB=gnl|TC-DB|Q48476|3.A.1.103.1;ID=MGYG000002712_5_16 DB=gnl|TC-DB|Q48476| means?

c) Does the cgc.out file needs to be filtered or can I summarize the results from this file as it is?

I am attaching the output files for reference incase my understanding is wrong. dbsub.out.txt cgc.out.txt

Looking forward to your reply!

please see #127 for answer to dbsub.out.

for cgc.out, it is explained in https://bcb.unl.edu/dbCAN_seq_old/help.php. But, it is still hard to understand, that's why we made cgc_standard.out, which is simplified version of cgc.out. The cols in cgc_standard.out are CGC_id, type, contig_id, gene_id, start, end, strand, annotation.

@linnabrown
Copy link
Owner

Thank you very much! We already rewrote our read.me in readthedoc format. Please give us any suggestions and comments for it. In addition, we have updated our tools for additional multiple functions.

https://dbcan.readthedocs.io/en/latest/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants