over-prediction?over-extended? #2

igortru · 2020-09-09T01:35:15Z

hundreds or thousands of “extra” genes per genome,
but black box is black box

for example
PGAP report 434 genes on plasmid
https://www.ncbi.nlm.nih.gov/protein?LinkName=nuccore_protein&from_uid=1887424984
but Balrog - 499 , are they real?

another issue : protein start position
NC_016612.1 Balrog CDS 488100 488564 . + 0 inference=ab initio prediction:Balrog;product=hypothetical protein
is it choose longest possible?

I prefer shorter version which correspond conserved domain :
compare
ref|WP_014226776.1|
ref|WP_016239709.1|
ref|WP_016247216.1|
ref|WP_021555225.1|
ref|WP_023303013.1|
ref|WP_032154610.1|
ref|WP_060617254.1|
ref|WP_109862812.1|
ref|WP_135564269.1|
ref|WP_171279038.1|
ref|WP_172833565.1|
ref|WP_172901084.1|
ref|WP_172949498.1|
ref|WP_181654360.1|

Markusjsommer · 2020-09-09T18:24:32Z

Hi Igortru,

Balrog tends to predict less genes than other gene finders with default parameters on complete bacterial genomes. We did not look specifically at plasmids, but it's an interesting place for comparison. I'm not super familiar with all the steps in PGAP, especially with plasmids, but Balrog would be intended as one piece in the larger pipeline, more akin to Prodigal/Glimmer/GeneMark rather than replacing a whole pipeline like Prokka or PGAP.

Looking at the "Klebsiella aerogenes strain RHBSTW-00938 plasmid pRHBSTW-00938_2, complete sequence" plasmid you mentioned, Balrog predicts 497 genes. I ran GeneMarkS-2 as a quick comparison on the same sequence and it predicted 544 genes, so it appears here Balrog actually predicts less.

For the start sites it's a bit more complicated. Balrog takes into account each potential ORF's start codon and the sequence around it, as well as the length. All else being equal, Balrog will tend to choose longer genes, but good hits with the Translation Initiation Site (TIS) model, or incompatibilities with other high-scoring genes, can shift the start site of any individual gene to be shorter. The global maximal gene score is found, rather than the maximal score for any one gene. I would still trust start sites based on evolutionary conservation more than the predictions of a gene finder, so a good step after running Balrog may be a better start site predictor which takes into account more complex information like that.

igortru · 2020-09-10T04:21:18Z

but Balrog would be intended as one piece in the larger pipeline

imho :
after 30 years of genomes annotation pipeline development , starting from scratch and invent more intellectual orffinder -it is like develop new operational system.

it can be very good exercise,
but exist much more difficult and interesting
problems for Deep Learning :
for example,
improvement of already existing annotations.

just idea:
take all non hypothetical protein names from refseq
take corresponding protein sequences ,

100M unique sequences ,many of them have good names,thanks to Daniel Haft,
and try to find distributed sequence motifs which
will allow predict protein names and fix incorrect ones.

I absolutely sure it is possible,and it is real interesting problem

P.S.
we really need command line version of your tool,
it will allow run it in batch mode and check more deeply.
I want compare it with Phannotate.

Markusjsommer · 2020-09-10T04:38:31Z

Ideally, Balrog would not be entirely replacing whole annotation pipelines. Rather, it's an attempt at using the vast amounts of data we have to train a more complex gene model than would have been possible 10 years ago. Hopefully we can integrate with and complement many of the other great tools out there (a faster command line version is definitely on my todo list).

Using a language model to predict/correct protein names based on sequence seems like an interesting idea, but a bit beyond the scope of this tool right now :)

Markusjsommer · 2020-09-10T04:42:47Z

As an aside, Balrog can process multiple genomes at once if you select multiple fasta files during the upload step, though Colab can be annoying about downloading more than 10 files at once

igortru · 2020-09-10T13:00:31Z

Ok. I am developing pipeline which will allow cluster/annotate closely related phages. let say, genomes from one genus,subfamily,family. At the end of each round I am expecting to have set of really good alignments . regular clustering is not always useful, very tight clusters -is not interesting very wide- multiple alignment is bad, truth somewhere in the middle : produce widest clusters inside taxonomic node which still have good alignments -tcoffee TCS as criteria as first step I am using “orffinderplus”, probably , it is not public yet, but I can share it with you,if you are interested, new ncbi program which combine existing genbank annotation with all top level orfs ( mark corresponding orfs with genbank accession if it was found on annotation) it useful and allow to see which orfs are new from one side and from othetr side allow add to processing some orfs which normal orffinder just ignore. actually ,it is problem for your tool as well -missing pseudo genes,I see it on klebsiella plasmid. orffinderplus produce too much orfs and I am looking how replace it. pgap, prokka and phannotate produce too few genes on phages and ... it is major problem -results are different and I don’t know which program I can trust. for now I prefer to have more models than less. again, I am talking only about phage/prophages. my current set is about 15K complete genbank genomes P.S. I’ll try ten genomes - charlie+redi+butter viruses

…

Sent from my iPhone

On Sep 10, 2020, at 12:43 AM, Markus Sommer ***@***.***> wrote: As an aside, Balrog can process multiple genomes at once if you select multiple fasta files during the upload step, though Colab can be annoying about downloading more than 10 files at once — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

igortru · 2020-09-10T16:52:44Z

It looks speed is proportional number of contigs , not total length of sequences.
I have uploaded 1.4Mb fasta which contain 31 phages.
but it takes already more than hour, probably your server already heavy loaded.
other reason outsource your tool :)

Markusjsommer · 2020-09-10T17:02:31Z

Definitely shouldn't take that long regardless of the number of contigs, I'm thinking it might be because of the way I'm calling MMseqs2. If you send me the fasta (zipped via github comment or you can email at [email protected]) I can try to find why it's taking so long for you

igortru · 2020-09-15T00:17:49Z

another issue: circular genomes support.
for phages,plasmids - it is major issue.
for bacterial genomes also could be useful.

igortru · 2020-09-15T00:34:46Z

phage KC576783 - 66 genes , Barlog missing 12 genes in comparison with genbank.
Other genomes from Butters+Charlie+Redi virus have the same problem.

method looks very interesting, but training set is not looking perfect ,
it will be very kind from your side if you make training part open source as well.
I just want feed it with all phage proteins.

Markusjsommer · 2020-09-17T17:43:26Z

Balrog is not super optimized for very short sequences like phage. I'll add circular genome support in a future release, as that is not too hard, but it will likely only result in finding 1-2 more genes.

Releasing a new model trained on viruses would not be too difficult, though other parameters may need to be retuned as well to get good performance.

Markusjsommer added the question Further information is requested label Sep 9, 2020

Markusjsommer closed this as completed Sep 9, 2020

Markusjsommer reopened this Sep 10, 2020

Markusjsommer added the bug Something isn't working label Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

over-prediction?over-extended? #2

over-prediction?over-extended? #2

igortru commented Sep 9, 2020

Markusjsommer commented Sep 9, 2020 •

edited

Loading

igortru commented Sep 10, 2020

Markusjsommer commented Sep 10, 2020

Markusjsommer commented Sep 10, 2020

igortru commented Sep 10, 2020 via email

igortru commented Sep 10, 2020

Markusjsommer commented Sep 10, 2020

igortru commented Sep 15, 2020

igortru commented Sep 15, 2020 •

edited

Loading

Markusjsommer commented Sep 17, 2020

over-prediction?over-extended? #2

over-prediction?over-extended? #2

Comments

igortru commented Sep 9, 2020

Markusjsommer commented Sep 9, 2020 • edited Loading

igortru commented Sep 10, 2020

Markusjsommer commented Sep 10, 2020

Markusjsommer commented Sep 10, 2020

igortru commented Sep 10, 2020 via email

igortru commented Sep 10, 2020

Markusjsommer commented Sep 10, 2020

igortru commented Sep 15, 2020

igortru commented Sep 15, 2020 • edited Loading

Markusjsommer commented Sep 17, 2020

Markusjsommer commented Sep 9, 2020 •

edited

Loading

igortru commented Sep 15, 2020 •

edited

Loading