Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manual on the tool output #2

Open
smb20200615 opened this issue May 13, 2021 · 9 comments
Open

manual on the tool output #2

smb20200615 opened this issue May 13, 2021 · 9 comments
Labels
documentation Improvements or additions to documentation

Comments

@smb20200615
Copy link

smb20200615 commented May 13, 2021

Hello,

Is there a manual that would explain the output files? I am interested in seeing what BCGs are shared by a range of genomes. The command to run the tool seems very simple but I am having trouble interpreting the output.

Many thanks!

@althonos
Copy link
Member

Hi Sara ! There is a small disclaimer on the website (gecco.embl.de) but nothing in detail. So I'm gonna explain it right here 😃

Formats

When you run GECCO, you get three type of files:

  • the XXX.features.tsv, which contains the genes and proteins domains found in your input sequence, in tab-separated-values format.
  • the XXX.clusters.tsv, which is created if BGCs were detected in your input, and contains one line per BGC, in tab-separated-values format.
  • the XXX_cluster_N.gbk files, where one GenBank file is created for each cluster

You may also want to run GECCO in verbose mode (gecco -v run ...) to get more feeling of what's going on. Now, about the files:

features.tsv

So, the features.tsv is really about domain annotation, it's probably not really interesting for your use case. With this table, you get one line per protein domain. The columns are:

  • sequence_id: the identifier of the sequence this domain belongs to
  • protein_id: the identifier of the protein this domain belongs to (named after sequence_id, since GECCO handles the gene finding)
  • start, end and strand: The coordinates of the genes within the sequence, and the strand (as + or -)

Since each row corresponds to a domain, and a protein can have several domains, you can see a lot of lines with these values in common. The ones that change after that are:

  • domain: The accession of the domain
  • hmm: The HMM library the domain comes from (either Pfam or Tigrfam)
  • i_evalue: The independent e-value that was given to this domain by hmmsearch
  • domain_start and domain_end: the coordinates of the domain in the protein sequence
  • bgc_probability: the probability assigned by GECCO to whether or not this gene belongs to a BGC.

clusters.tsv

Here, you get a row for each BGC that was detected in your input:

  • sequence_id: the identifier of the sequence this BGC was found in
  • bgc_id: the identifier given to the BGC by GECCO
  • start and end: the coordinates of the BGC within the sequence
  • average_p: the average probability for all the genes to be in a BGC ( sum of the probability for each gene / number of genes)
  • max_p: the probability of the gene with the highest probability to belong to a BGC
  • type: the predicted BGC type / biosynthetic class
  • alkaloid_probability, polyketide_probability, ripp_probability, saccharide_probability, terpene_probability, nrp_probability, other_probability: the probability the BGC has to be of a given type (you can use that to inspect what the type column is reporting to check the confidence of the type assignment)
  • proteins: a semicolon-separated list of the proteins belonging to the BGCs
  • domains: a semicolon-separated list of the protein domains within the BGC

GenBank files

Each GenBank file created contains only the sequence for the BGC; genes found by Prodigal are marked with /CDS annotations, and domains found by HMMER are marked with a /misc_feature annotation.

Checking similar BGCs

For this, I'd recommend having a look at MMseqs2, in particular the linclust command. Once you are done finding BGCs with GECCO, you can just use that to check if the ones you found cluster together, and then map that back to the genomes they originate from. Notably, if you use --cov-mode=1 you should be able to detect fragmented BGCs at the nucleotide level. Note that it only helps to detect BGCs with the same synteny (because you stay at the nucleotide level).

Otherwise there are more dedicated tools like BiG-SLiCE to explore similar BGCs, but you may need to convert format the inputs properly to make it work.

@althonos althonos added the documentation Improvements or additions to documentation label May 13, 2021
@smb20200615
Copy link
Author

Thank you so much for your reply. just to clarify, I see that bigslice can work with output of deepBGC and antismash. Is it also possible to run with the output of your tool? I have the gbk file but also need a csv file describing all region coordinates. https://github.com/medema-group/bigslice/blob/master/misc/generate_antismash_gbk/generate_antismash_gbk.py

@althonos
Copy link
Member

@smb20200615 : you can find the region coordinates in the {something}.clusters.tsv file among the GECCO outputs. You'll just need to adapt the script you linked to load from that 👍

@althonos
Copy link
Member

Hi @smb20200615 ,

in v0.7.0 I added a dedicated subcommand to help using GECCO results with BiG-SLiCE without having to write the conversion script yourself. Have a look at the new documentation page for BiG-SLiCE integration!

@smb20200615
Copy link
Author

Thank you! Is this version downloadable via bioconda?

@althonos
Copy link
Member

althonos commented Jun 1, 2021

@smb20200615 it will be soon, I need to address #3 first.

@smb20200615
Copy link
Author

@althonos, thank you so much for your help. I just tried the region gbk files outputted by gecco convert and they still are not parsed correctly by tools such as bigscape. Is there anyway to generate them so they resemble more the antismash gbks? I am not sure about the is difference between the two

@althonos
Copy link
Member

althonos commented Jun 4, 2021

Ah, I haven't tried with BiG-SCAPE, i'll see if there is a way to make the GenBank files compatible. IIRC, the issue is that BIG-SCAPE expects the GenBank files to label genes by kind (e.g. biosynthetic, transport, regulatory) but GECCO is not doing that, and there is no simple way to get that without doing an extra round of annotation with HMMER and AntiSMASH smCOGS.

Another issue is that AntiSMASH GenBank files 1. have non-standard features and qualifiers that often make sense only in AntiSMASH context and 2. have different type predictions compared to GECCO (and MIBIG or DeepBGC).

@smb20200615
Copy link
Author

Makes sense. Are there any other methods for clustering with known BGCs? I am not fully sure what you used in your paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants