Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate: Improve Design to Handle Large Downloads #222

Open
b-f-chan opened this issue Mar 29, 2022 · 2 comments
Open

Investigate: Improve Design to Handle Large Downloads #222

b-f-chan opened this issue Mar 29, 2022 · 2 comments

Comments

@b-f-chan
Copy link
Contributor

The issue regarding failed downloads for large sets will require more analysis and effort. The issue is the platform as a whole (particularly singularity) was not designed to handle such large data sets (we had to stand this platform up in about 6 weeks and back then we weren’t even sure we would get much data; of course things have changed in the past year).

So to do this properly, a re-architecting of some services may be required. Also, we need to look at adding controls and restrictions on what people can do on this portal. Currently it is 100% free-for-all. Any unauthenticated person can download any amount or make any large query at any time. As mentioned before this could even lead to DOS attacks on our system. So this whole thing needs to be re-thought. E.g. Should we limit downloads to a certain size and anything beyond they would need to achieve via other means (e.g. asynchronous notification once download ready, or maybe for technical users we can open SONG CLI and give a manifest to download)?

More thinking and discussion needs to be done here before a hasty decision is made. Further, this should not be considered a bug” or something to fix while we are in maintenance mode. This is a re-design and optimization and would fit perfectly in Work Package #1 of the new project extension being proposed, which is to improve the system stability and performance. Hence we should only tackle this as part of the new proposal once it is signed off.

@b-f-chan b-f-chan added the enhancement New feature or request label Mar 29, 2022
@scottcain
Copy link
Contributor

This issue is exactly the one that is causing the data release builds to fail, which is why I bumped it up to critical. As a work around for the time being, the only way to get "complete" data dumps from the portal is to use the explorer and download distinct "chunks", keeping each of the chunks at around 100k sequences or less. I've achieved this using the Study ID filter to select BC, then ON and AB, then the rest to get three downloads that constitute a complete download of the portal data.

@ghost ghost added the WP_1 label Jun 3, 2022
@b-f-chan b-f-chan assigned ghost Jun 3, 2022
@b-f-chan
Copy link
Contributor Author

b-f-chan commented Jun 3, 2022

Assigned to @sifavahora to investigate and discuss options with DEV team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants