Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control point to resume the download manually #27

Open
Phismil opened this issue Apr 10, 2023 · 6 comments
Open

Control point to resume the download manually #27

Phismil opened this issue Apr 10, 2023 · 6 comments

Comments

@Phismil
Copy link

Phismil commented Apr 10, 2023

Dear Developer,
Thank you for maintaining and updating the repository.
I am trying to download all COI sequences from NCBI, which is around four million entries.
The download consistently failed after approximately 600 000 sequences, and adding my API key and changing the source code to stay asleep for longer than 8s did not help.
Is there any trick that might solve the issue or resume the download later on, exactly from the last entry in the interrupted fasta file (e.g., similar to a control point in the web history)?
Thank you

@StuntsPT
Copy link
Owner

Dear @Phismil,
First of all, thank you for reaching out.
If you run the same command again, the download should resume from where it left off. It might take a while to restart, as the program will download the accession number list, and compare it with those already in the FASTA file, but it shoudl resume the download.
Also, I am curious, how exactly is the program failing? Does the program crash? Is there any error message? Or does it just freeze and stops downloading data?
Thank you.

Francisco

@Phismil
Copy link
Author

Phismil commented Apr 12, 2023

Dear Francisco,
Thank you for your response. I replicated the error.
Below is the error I receive, which usually happens after downloading ~500-600K records.
It might be directly linked to our local server/proxy setting. I will try it on an AWS or GC engine and update you.

Downloading records 692401 to 692600 of 4122811
Downloading record 692601 to 692800 of 4122811
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.
NCBI is not retuning sequence data or we're getting a Timeout. Trying the same chunk again in 8''.

Cheers

@StuntsPT
Copy link
Owner

And then it just stops there?
Or does it actually crash?
Can you please also post the exact command you are using, program version, and the last few sequence names (just the > lines) from the resulting FASTA file?

What happens if you just Ctrl+C the program, and run the same exact command again?

Best,
Francisco

@Phismil
Copy link
Author

Phismil commented Apr 19, 2023

Dear Francisco
I apologize for the delay; I wanted to spin a new computing engine before updating you.
The pipeline has downloaded all COI sequences (~4 million) in three to four attempts in both Amazon AWS and Google Cloud Engin when there is no proxy setting. In the university's local server, with typical proxy settings, occasionally, the pipeline needs > 10 attempts to download all sequences. I checked the tail of generated .fasta files, and there was nothing unusual such as an error or a warning from the NCBI server. It was just a normal ending, and when I restarted the pipeline to resume the download of missing records, the new records were appended to the generated .fasta file.
Thank you for your time, and please let me know if more information is needed.

@StuntsPT
Copy link
Owner

Dear @Phismil,

Thank you for the follow up. I'm happy to read you managed to get your sequences.
You are not the first person having issues when behind a proxy server, but I wouldn't even know where to start debugging. More likely than not, it is an issue between NCBI and the proxy server. The issue is that the program is neither getting an error response, nor is the requests library issuing a timeout (which means it somehow still thinks it's receiving data).
I will leave the issue open for now, and try to reach the bottom of it during summertime.
I may then request your help again in running the program and reporting on whether or not it worked. =-)
Best,

Francisco

@Phismil
Copy link
Author

Phismil commented Apr 19, 2023

I will do this with pleasure.
Yes, the problem is exactly what you mentioned, and it consistently happens after downloading 500K to 600K sequences.
Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants