-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Required FASTA header format? #10
Comments
Hi Maddy,
the secret is maintain the format hope it can help cheers PD: try the new code, I already make changes to avoid the double |
Thank you for your quick response! |
Your welcome, would you share some headers to reproduce the error?, or your entire fasta (or a half) would be perfect, I'm wondering if something goes wrong with the ID selection in the code. Also paste the line code you are using. cheers |
I attached a portion of my fasta file. I run pastetaxid with:
I was able to get the example multifasta file to work, with all IDs added |
Thanks Maddy, I found the bug and already fixed it, the problem was only with the parallel works that chrash when taxID is found and connection retries is still trying to fetch the ID. try using this fixed code: https://github.com/Sanrrone/pasteTaxID while I create the pull request to this repo. also I attached the new fasta with the corresponding taxID. let me know if it works to close the issue or continue improving the code. cheers |
The fix seems to be giving me the taxon ids now. Thank you! However, I have noticed that the script sometimes get stuck getting IDs from NCBI (the output with --debug shows that its tries to fetch the ID from NCBI over and over again with no luck). It does not appear to get stuck on the file I sent you (though its been running over 40 mins now with 8 processors and is not done), but another set of sequences (attached) gets stuck. It sounds similar to the problem mladen5000 had in March. I am also submitting it as a remote job. |
I checked some info about the parallel job for ncbi api key and I found they provide a maximum of 10 request per second (https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
I checked this using parallel jobs at 8, and 20. Using 8 parallel process I have no error and the new fasta is fine, but using 20 (that is more than ncbi let you fetch), the script get stuck, the retries are forcing for a new connection because ncbi return an error or just "blank" result (the retry is just an implement to avoid errors like internet connection but I didn't expect an overrequest error). so, the solution is just not let the parallel job be more than 10, if you have 50 cores, anyway set it to 10. let me know if it's continues getting stuck using just 10 process. |
@bioinfoMMS an update Maddy, the script sometimes reach more than one fetch per process, so try to set the jobs to 9 insteaf of 10, check this performance:
cheers |
I still had the problem with 8 processors. It still could be something due to NCBI's limit but it is strange that you are able to run it with no problems and really quickly with the same number of processors. It also seems to have gotten stuck for me on the first file I sent you as well :( |
Ok so I ran it with no parallelization:
And it still gets stuck , the problematic taxon id was JF714137.1 if that helps. I have it as 'acc' even though it is a Genbank sequence. Would this cause a problem? Does it need to be labeled as 'gbk'? |
mmm, a it's weird behaviour. I just already run the script with a fake fasta
and works
I'm thinking about the internet, sometimes I remind stucks when it was slow or "intermittent", have you another connection? (maybe cellphone?), sorry I have no more ideas at the moment, it not seems a code problem :(. |
It works well enough that I should be able to use your script, thank you again for all your help and quick replies! It is much appreciated! |
your welcome :), feel free to ask any other question. good luck with your work. |
Hello,
I wonder if you could give me more information about the required format for the fasta headers. I have been running pasteTaxID and while I don't get any errors, the tax ids do not show up in the results.
This header:
'>acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'
Comes out as:
'>ti||acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'
I am guessing it may have something to do with the header format. I did try to remove the GENBANK part so the header was:
'>acc|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'
A few of the tax ids were found, but most were not. For example:
' >ti||acc|FJ640294.1|Uncultured_marine_virus_isolate_CBSM-188_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009'
'>ti|186617|acc|FJ640295.1|Uncultured_marine_virus_isolate_CBSM-189_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009'
Any help would be appreciated.
Thanks,
Maddy
The text was updated successfully, but these errors were encountered: