-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
over-prediction?over-extended? #2
Comments
Hi Igortru, Balrog tends to predict less genes than other gene finders with default parameters on complete bacterial genomes. We did not look specifically at plasmids, but it's an interesting place for comparison. I'm not super familiar with all the steps in PGAP, especially with plasmids, but Balrog would be intended as one piece in the larger pipeline, more akin to Prodigal/Glimmer/GeneMark rather than replacing a whole pipeline like Prokka or PGAP. Looking at the "Klebsiella aerogenes strain RHBSTW-00938 plasmid pRHBSTW-00938_2, complete sequence" plasmid you mentioned, Balrog predicts 497 genes. I ran GeneMarkS-2 as a quick comparison on the same sequence and it predicted 544 genes, so it appears here Balrog actually predicts less. For the start sites it's a bit more complicated. Balrog takes into account each potential ORF's start codon and the sequence around it, as well as the length. All else being equal, Balrog will tend to choose longer genes, but good hits with the Translation Initiation Site (TIS) model, or incompatibilities with other high-scoring genes, can shift the start site of any individual gene to be shorter. The global maximal gene score is found, rather than the maximal score for any one gene. I would still trust start sites based on evolutionary conservation more than the predictions of a gene finder, so a good step after running Balrog may be a better start site predictor which takes into account more complex information like that. |
imho : it can be very good exercise, just idea:
I absolutely sure it is possible,and it is real interesting problem P.S. |
Ideally, Balrog would not be entirely replacing whole annotation pipelines. Rather, it's an attempt at using the vast amounts of data we have to train a more complex gene model than would have been possible 10 years ago. Hopefully we can integrate with and complement many of the other great tools out there (a faster command line version is definitely on my todo list). Using a language model to predict/correct protein names based on sequence seems like an interesting idea, but a bit beyond the scope of this tool right now :) |
As an aside, Balrog can process multiple genomes at once if you select multiple fasta files during the upload step, though Colab can be annoying about downloading more than 10 files at once |
Ok.
I am developing pipeline which will allow cluster/annotate closely related phages.
let say, genomes from one genus,subfamily,family.
At the end of each round I am expecting to have
set of really good alignments .
regular clustering is not always useful,
very tight clusters -is not interesting
very wide- multiple alignment is bad,
truth somewhere in the middle :
produce widest clusters inside taxonomic node which still have good alignments -tcoffee TCS as criteria
as first step I am using “orffinderplus”,
probably , it is not public yet,
but I can share it with you,if you are interested,
new ncbi program which combine existing genbank annotation with all top level orfs ( mark corresponding orfs with genbank accession if it was found on annotation)
it useful and allow to see which orfs are new from one side and from othetr side allow add to processing some orfs which normal orffinder just ignore.
actually ,it is problem for your tool as well -missing pseudo genes,I see it on klebsiella plasmid.
orffinderplus produce too much orfs
and I am looking how replace it.
pgap, prokka and phannotate produce too few genes on phages
and ... it is major problem -results are different and I don’t know which program I can trust.
for now I prefer to have more models than less.
again, I am talking only about phage/prophages.
my current set is about 15K complete genbank genomes
P.S.
I’ll try ten genomes - charlie+redi+butter viruses
…Sent from my iPhone
On Sep 10, 2020, at 12:43 AM, Markus Sommer ***@***.***> wrote:
As an aside, Balrog can process multiple genomes at once if you select multiple fasta files during the upload step, though Colab can be annoying about downloading more than 10 files at once
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
It looks speed is proportional number of contigs , not total length of sequences. |
Definitely shouldn't take that long regardless of the number of contigs, I'm thinking it might be because of the way I'm calling MMseqs2. If you send me the fasta (zipped via github comment or you can email at [email protected]) I can try to find why it's taking so long for you |
another issue: circular genomes support. |
phage KC576783 - 66 genes , Barlog missing 12 genes in comparison with genbank. method looks very interesting, but training set is not looking perfect , |
Balrog is not super optimized for very short sequences like phage. I'll add circular genome support in a future release, as that is not too hard, but it will likely only result in finding 1-2 more genes. Releasing a new model trained on viruses would not be too difficult, though other parameters may need to be retuned as well to get good performance. |
for example
PGAP report 434 genes on plasmid
https://www.ncbi.nlm.nih.gov/protein?LinkName=nuccore_protein&from_uid=1887424984
but Balrog - 499 , are they real?
another issue : protein start position
NC_016612.1 Balrog CDS 488100 488564 . + 0 inference=ab initio prediction:Balrog;product=hypothetical protein
is it choose longest possible?
I prefer shorter version which correspond conserved domain :
compare
ref|WP_014226776.1|
ref|WP_016239709.1|
ref|WP_016247216.1|
ref|WP_021555225.1|
ref|WP_023303013.1|
ref|WP_032154610.1|
ref|WP_060617254.1|
ref|WP_109862812.1|
ref|WP_135564269.1|
ref|WP_171279038.1|
ref|WP_172833565.1|
ref|WP_172901084.1|
ref|WP_172949498.1|
ref|WP_181654360.1|
The text was updated successfully, but these errors were encountered: