Investigate: Improve Design to Handle Large Downloads #222

b-f-chan · 2022-03-29T12:59:12Z

The issue regarding failed downloads for large sets will require more analysis and effort. The issue is the platform as a whole (particularly singularity) was not designed to handle such large data sets (we had to stand this platform up in about 6 weeks and back then we weren’t even sure we would get much data; of course things have changed in the past year).

So to do this properly, a re-architecting of some services may be required. Also, we need to look at adding controls and restrictions on what people can do on this portal. Currently it is 100% free-for-all. Any unauthenticated person can download any amount or make any large query at any time. As mentioned before this could even lead to DOS attacks on our system. So this whole thing needs to be re-thought. E.g. Should we limit downloads to a certain size and anything beyond they would need to achieve via other means (e.g. asynchronous notification once download ready, or maybe for technical users we can open SONG CLI and give a manifest to download)?

More thinking and discussion needs to be done here before a hasty decision is made. Further, this should not be considered a bug” or something to fix while we are in maintenance mode. This is a re-design and optimization and would fit perfectly in Work Package #1 of the new project extension being proposed, which is to improve the system stability and performance. Hence we should only tackle this as part of the new proposal once it is signed off.

scottcain · 2022-05-19T16:32:13Z

This issue is exactly the one that is causing the data release builds to fail, which is why I bumped it up to critical. As a work around for the time being, the only way to get "complete" data dumps from the portal is to use the explorer and download distinct "chunks", keeping each of the chunks at around 100k sequences or less. I've achieved this using the Study ID filter to select BC, then ON and AB, then the rest to get three downloads that constitute a complete download of the portal data.

b-f-chan · 2022-06-03T19:44:19Z

Assigned to @sifavahora to investigate and discuss options with DEV team

b-f-chan added the enhancement New feature or request label Mar 29, 2022

scottcain mentioned this issue Apr 11, 2022

Improvements for SONG #235

Closed

scottcain added the critical label May 17, 2022

scottcain mentioned this issue May 19, 2022

Data Release Page not Updating #254

Closed

ghost added the WP_1 label Jun 3, 2022

b-f-chan assigned ghost Jun 3, 2022

b-f-chan added the specs-needed label Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate: Improve Design to Handle Large Downloads #222

Investigate: Improve Design to Handle Large Downloads #222

b-f-chan commented Mar 29, 2022

scottcain commented May 19, 2022

b-f-chan commented Jun 3, 2022

Investigate: Improve Design to Handle Large Downloads #222

Investigate: Improve Design to Handle Large Downloads #222

Comments

b-f-chan commented Mar 29, 2022

scottcain commented May 19, 2022

b-f-chan commented Jun 3, 2022