Required FASTA header format? #10

bioinfoMMS · 2019-08-21T20:06:48Z

Hello,

I wonder if you could give me more information about the required format for the fasta headers. I have been running pasteTaxID and while I don't get any errors, the tax ids do not show up in the results.

This header:

'>acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'

Comes out as:

'>ti||acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'

I am guessing it may have something to do with the header format. I did try to remove the GENBANK part so the header was:

'>acc|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'

A few of the tax ids were found, but most were not. For example:

' >ti||acc|FJ640294.1|Uncultured_marine_virus_isolate_CBSM-188_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009'
'>ti|186617|acc|FJ640295.1|Uncultured_marine_virus_isolate_CBSM-189_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009'

Any help would be appreciated.

Thanks,
Maddy

Sanrrone · 2019-08-21T22:19:20Z

remove the GENBANK after acc|: sed "s/GENBANK\|//g" yourfasta.fasta
remove "empty" entries (like ti||acc|xxxxx): sed "s/ti\|\|//g" yourfasta.fasta

the secret is maintain the format ncbiID|number.

hope it can help

cheers

PD: try the new code, I already make changes to avoid the double ti|xxxx|ti|xxxxx in the final fasta

bioinfoMMS · 2019-08-22T14:55:30Z

Thank you for your quick response!
I put all my headers into the format of >acc|xxxx blablabla but only the first entry in the multifasta file had its taxon id successfully fetched. The others all have the empty ti|| at the beginning. Any thoughts?

Sanrrone · 2019-08-22T15:03:20Z

Your welcome, would you share some headers to reproduce the error?, or your entire fasta (or a half) would be perfect, I'm wondering if something goes wrong with the ID selection in the code. Also paste the line code you are using.

cheers

bioinfoMMS · 2019-08-22T15:15:45Z

I attached a portion of my fasta file. I run pastetaxid with:

bash pasteTaxID.bash --multifasta viralseqs.txt --parallelJobs 8 --apikey myncbiapikey

I was able to get the example multifasta file to work, with all IDs added

viralseqs.txt

Sanrrone · 2019-08-22T16:34:12Z

Thanks Maddy, I found the bug and already fixed it, the problem was only with the parallel works that chrash when taxID is found and connection retries is still trying to fetch the ID. try using this fixed code: https://github.com/Sanrrone/pasteTaxID while I create the pull request to this repo.

also I attached the new fasta with the corresponding taxID.
new_viralseqs.txt

let me know if it works to close the issue or continue improving the code.

cheers

bioinfoMMS · 2019-08-22T18:51:49Z

The fix seems to be giving me the taxon ids now. Thank you!

However, I have noticed that the script sometimes get stuck getting IDs from NCBI (the output with --debug shows that its tries to fetch the ID from NCBI over and over again with no luck). It does not appear to get stuck on the file I sent you (though its been running over 40 mins now with 8 processors and is not done), but another set of sequences (attached) gets stuck. It sounds similar to the problem mladen5000 had in March. I am also submitting it as a remote job.

test_viral_2.txt

Sanrrone · 2019-08-22T19:21:00Z

I checked some info about the parallel job for ncbi api key and I found they provide a maximum of 10 request per second (https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)

Now that I have a key, are there still access limits?

Yes. By default, your key will increase the limit to 10 requests/second for all activity from that key. 
If we receive requests at a higher frequency that include the same key, all requests using that key will receive an error message. 
If you need higher rates of access, please contact us and we can discuss your situation.

I checked this using parallel jobs at 8, and 20. Using 8 parallel process I have no error and the new fasta is fine, but using 20 (that is more than ncbi let you fetch), the script get stuck, the retries are forcing for a new connection because ncbi return an error or just "blank" result (the retry is just an implement to avoid errors like internet connection but I didn't expect an overrequest error).

so, the solution is just not let the parallel job be more than 10, if you have 50 cores, anyway set it to 10.

let me know if it's continues getting stuck using just 10 process.

Sanrrone · 2019-08-22T19:33:16Z

@bioinfoMMS an update Maddy, the script sometimes reach more than one fetch per process, so try to set the jobs to 9 insteaf of 10, check this performance:

sandro@elitedesk1$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 10 --apikey mykey)

real	2m46.568s
user	1m8.963s
sys	0m26.808s
sandro@elitedesk1$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 9 --apikey mykey)

real	0m54.901s
user	0m17.986s
sys	0m7.231s
sandro@elitedesk1:$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 8 --apikey mykey)

real	0m55.204s
user	0m15.851s
sys	0m6.516s

cheers

bioinfoMMS · 2019-08-22T19:53:58Z

I still had the problem with 8 processors. It still could be something due to NCBI's limit but it is strange that you are able to run it with no problems and really quickly with the same number of processors. It also seems to have gotten stuck for me on the first file I sent you as well :(

bioinfoMMS · 2019-08-22T19:59:45Z

Ok so I ran it with no parallelization:

bash pasteTaxID.bash --multifasta test_viral_2.txt --debug

And it still gets stuck , the problematic taxon id was JF714137.1 if that helps. I have it as 'acc' even though it is a Genbank sequence. Would this cause a problem? Does it need to be labeled as 'gbk'?

Sanrrone · 2019-08-22T20:11:25Z

mmm, a it's weird behaviour. I just already run the script with a fake fasta

>acc|JF714137.1 fakeID
CGCAGTCAGTCAGTCACGTACAGTCACGATCA

and works

>ti|1000308|acc|JF714137.1 fakeID
CGCAGTCAGTCAGTCACGTACAGTCACGATCA

I'm thinking about the internet, sometimes I remind stucks when it was slow or "intermittent", have you another connection? (maybe cellphone?), sorry I have no more ideas at the moment, it not seems a code problem :(.

bioinfoMMS · 2019-08-23T20:59:22Z

It works well enough that I should be able to use your script, thank you again for all your help and quick replies! It is much appreciated!

Sanrrone · 2019-08-26T14:33:47Z

your welcome :), feel free to ask any other question.

good luck with your work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Required FASTA header format? #10

Required FASTA header format? #10

bioinfoMMS commented Aug 21, 2019 •

edited

Loading

Sanrrone commented Aug 21, 2019 •

edited

Loading

bioinfoMMS commented Aug 22, 2019

Sanrrone commented Aug 22, 2019

bioinfoMMS commented Aug 22, 2019 •

edited

Loading

Sanrrone commented Aug 22, 2019 •

edited

Loading

bioinfoMMS commented Aug 22, 2019

Sanrrone commented Aug 22, 2019 •

edited

Loading

Sanrrone commented Aug 22, 2019 •

edited

Loading

bioinfoMMS commented Aug 22, 2019

bioinfoMMS commented Aug 22, 2019

Sanrrone commented Aug 22, 2019

bioinfoMMS commented Aug 23, 2019

Sanrrone commented Aug 26, 2019 •

edited

Loading

Required FASTA header format? #10

Required FASTA header format? #10

Comments

bioinfoMMS commented Aug 21, 2019 • edited Loading

Sanrrone commented Aug 21, 2019 • edited Loading

bioinfoMMS commented Aug 22, 2019

Sanrrone commented Aug 22, 2019

bioinfoMMS commented Aug 22, 2019 • edited Loading

Sanrrone commented Aug 22, 2019 • edited Loading

bioinfoMMS commented Aug 22, 2019

Sanrrone commented Aug 22, 2019 • edited Loading

Sanrrone commented Aug 22, 2019 • edited Loading

bioinfoMMS commented Aug 22, 2019

bioinfoMMS commented Aug 22, 2019

Sanrrone commented Aug 22, 2019

bioinfoMMS commented Aug 23, 2019

Sanrrone commented Aug 26, 2019 • edited Loading

bioinfoMMS commented Aug 21, 2019 •

edited

Loading

Sanrrone commented Aug 21, 2019 •

edited

Loading

bioinfoMMS commented Aug 22, 2019 •

edited

Loading

Sanrrone commented Aug 22, 2019 •

edited

Loading

Sanrrone commented Aug 22, 2019 •

edited

Loading

Sanrrone commented Aug 22, 2019 •

edited

Loading

Sanrrone commented Aug 26, 2019 •

edited

Loading