Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

over-prediction?over-extended? #2

Open
igortru opened this issue Sep 9, 2020 · 10 comments
Open

over-prediction?over-extended? #2

igortru opened this issue Sep 9, 2020 · 10 comments
Labels
bug Something isn't working question Further information is requested

Comments

@igortru
Copy link

igortru commented Sep 9, 2020

hundreds or thousands of “extra” genes per genome,
but black box is black box

for example
PGAP report 434 genes on plasmid
https://www.ncbi.nlm.nih.gov/protein?LinkName=nuccore_protein&from_uid=1887424984
but Balrog - 499 , are they real?

another issue : protein start position
NC_016612.1 Balrog CDS 488100 488564 . + 0 inference=ab initio prediction:Balrog;product=hypothetical protein
is it choose longest possible?

I prefer shorter version which correspond conserved domain :
compare
ref|WP_014226776.1|
ref|WP_016239709.1|
ref|WP_016247216.1|
ref|WP_021555225.1|
ref|WP_023303013.1|
ref|WP_032154610.1|
ref|WP_060617254.1|
ref|WP_109862812.1|
ref|WP_135564269.1|
ref|WP_171279038.1|
ref|WP_172833565.1|
ref|WP_172901084.1|
ref|WP_172949498.1|
ref|WP_181654360.1|

@Markusjsommer
Copy link
Contributor

Markusjsommer commented Sep 9, 2020

Hi Igortru,

Balrog tends to predict less genes than other gene finders with default parameters on complete bacterial genomes. We did not look specifically at plasmids, but it's an interesting place for comparison. I'm not super familiar with all the steps in PGAP, especially with plasmids, but Balrog would be intended as one piece in the larger pipeline, more akin to Prodigal/Glimmer/GeneMark rather than replacing a whole pipeline like Prokka or PGAP.

Looking at the "Klebsiella aerogenes strain RHBSTW-00938 plasmid pRHBSTW-00938_2, complete sequence" plasmid you mentioned, Balrog predicts 497 genes. I ran GeneMarkS-2 as a quick comparison on the same sequence and it predicted 544 genes, so it appears here Balrog actually predicts less.

For the start sites it's a bit more complicated. Balrog takes into account each potential ORF's start codon and the sequence around it, as well as the length. All else being equal, Balrog will tend to choose longer genes, but good hits with the Translation Initiation Site (TIS) model, or incompatibilities with other high-scoring genes, can shift the start site of any individual gene to be shorter. The global maximal gene score is found, rather than the maximal score for any one gene. I would still trust start sites based on evolutionary conservation more than the predictions of a gene finder, so a good step after running Balrog may be a better start site predictor which takes into account more complex information like that.

@Markusjsommer Markusjsommer added the question Further information is requested label Sep 9, 2020
@igortru
Copy link
Author

igortru commented Sep 10, 2020

but Balrog would be intended as one piece in the larger pipeline

imho :
after 30 years of genomes annotation pipeline development , starting from scratch and invent more intellectual orffinder -it is like develop new operational system.

it can be very good exercise,
but exist much more difficult and interesting
problems for Deep Learning :
for example,
improvement of already existing annotations.

just idea:
take all non hypothetical protein names from refseq
take corresponding protein sequences ,

100M unique sequences ,many of them have good names,thanks to Daniel Haft,
and try to find distributed sequence motifs which
will allow predict protein names and fix incorrect ones.

I absolutely sure it is possible,and it is real interesting problem

P.S.
we really need command line version of your tool,
it will allow run it in batch mode and check more deeply.
I want compare it with Phannotate.

@Markusjsommer
Copy link
Contributor

Ideally, Balrog would not be entirely replacing whole annotation pipelines. Rather, it's an attempt at using the vast amounts of data we have to train a more complex gene model than would have been possible 10 years ago. Hopefully we can integrate with and complement many of the other great tools out there (a faster command line version is definitely on my todo list).

Using a language model to predict/correct protein names based on sequence seems like an interesting idea, but a bit beyond the scope of this tool right now :)

@Markusjsommer Markusjsommer reopened this Sep 10, 2020
@Markusjsommer
Copy link
Contributor

As an aside, Balrog can process multiple genomes at once if you select multiple fasta files during the upload step, though Colab can be annoying about downloading more than 10 files at once

@igortru
Copy link
Author

igortru commented Sep 10, 2020 via email

@igortru
Copy link
Author

igortru commented Sep 10, 2020

It looks speed is proportional number of contigs , not total length of sequences.
I have uploaded 1.4Mb fasta which contain 31 phages.
but it takes already more than hour, probably your server already heavy loaded.
other reason outsource your tool :)

@Markusjsommer
Copy link
Contributor

Definitely shouldn't take that long regardless of the number of contigs, I'm thinking it might be because of the way I'm calling MMseqs2. If you send me the fasta (zipped via github comment or you can email at [email protected]) I can try to find why it's taking so long for you

@Markusjsommer Markusjsommer added the bug Something isn't working label Sep 10, 2020
@igortru
Copy link
Author

igortru commented Sep 15, 2020

another issue: circular genomes support.
for phages,plasmids - it is major issue.
for bacterial genomes also could be useful.

@igortru
Copy link
Author

igortru commented Sep 15, 2020

phage KC576783 - 66 genes , Barlog missing 12 genes in comparison with genbank.
Other genomes from Butters+Charlie+Redi virus have the same problem.

method looks very interesting, but training set is not looking perfect ,
it will be very kind from your side if you make training part open source as well.
I just want feed it with all phage proteins.

@Markusjsommer
Copy link
Contributor

Balrog is not super optimized for very short sequences like phage. I'll add circular genome support in a future release, as that is not too hard, but it will likely only result in finding 1-2 more genes.

Releasing a new model trained on viruses would not be too difficult, though other parameters may need to be retuned as well to get good performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants