Need to have and implement general download policy #1079

jwj61 · 2016-04-21T20:50:14Z

Different areas let you try to download search results. When there are lots of results, this will fail with a proxy error which really comes from a timeout on the server. This can easily happen with number fields (with over 5 million fields in the database), and will happen in other areas as they expand.

Options I see:

remove the download buttons when there are too many search results. Possibly provide a link to the api page instead
put up a message saying results are being compiled, start a process on the server to make the file which is not subject to the timeout, and then provide the link when the file is ready. Files would have to be cleaned up afterward.

A drawback of (1) is that the download links make a customized version of the results whereas the api doesn't, and the result is not as friendly to systems (sage, gp, magma) where the data may be destined. Also, the api currently has a total limit on number of entries to be retrieved, although that can be changed. On the other hand, I am not sure how to make the side process which is not governed by the server timeout.

JohnCremona · 2016-04-22T09:35:51Z

I don't really understand why the server should time out, when it can do the query. Is it caused by the fact that the query produces an iterator, and the iterator itself times out if left open for too long? I ran in to this a while back when I was looking through elliptic curves and doing some very nontrivial computations with each one.

If that is the case, here is a solution: (1) make sure that the query which produces the downlod data only delivers the data fields which are needed; (2) convert the iterator into a list as soon as it is created; (3) create the download file from that list instead of directly from the iterator.

For a quick test, try changing the donwload_search function call to give it list(res) instead of res and see if it still times out.

haraldschilly · 2016-04-22T20:03:17Z

as you already said, a proper way to really address this download question is, to use a background task manager. the most natural fit for the lmfdb setup is http://www.celeryproject.org/

as a task broker, use https://www.rabbitmq.com/
the generated data is stored in mongodb, (called "results backend")

regarding the front-end, you are then periodically polling an endpoint giving you progress information or other status updates (or none), and at some point when the async task has finished, present the link to download the result.

drawback of all this is, that it roughly doubles the number of moving parts one has to take care of on the server. also, for local development, you need to implement some simple fallbacks, when such a task manager isn't available.

jwj61 · 2016-04-22T20:08:10Z

Thanks Harald. It sounds like it is worth looking into.

@JohnCremona : This is not the same as a cursor going bad. If run locally, the current code works. The warwick servers for beta and prod have timeouts for web requests, and that is what is killing the download.

I have not had a chance to try your suggestion, but here is a reason why it is different. Say you do a search with a million results. The "find" takes no time -- it is essentially recording the parameters of your search. For the search results page, we use skip and limit, and then process say 20 objects. If the user then clicks download, we have to process a million objects, which will take roughly 50000 times as long.

JohnCremona · 2016-04-25T13:29:14Z

In fact the point is that for each object to be downloaded, currently a WebNumberField object is created and that is taking much too long. I have a version of the code which avoids that and produces the exact same output.
Example: all quadratic fields, 1368485 search results, took about 2m15s to create the file. Not brilliant and perhaps worth a warning to the user, but quite possible and no timeout.
I'll make a pull request and then you can try it out.

JohnCremona · 2016-04-28T13:05:52Z

On the same theme, if for example one searches for number fields with no constraints there are 5 million hits of which 20 are displayed, and one is not allowed to download the full search results since there are too many. But I can imagine a user just wanting to download the 20 fields which are displayed (or a larger number if they have asked to see more). This would be easy to implement (here and in all similar situations: we could have both download_all and download_displayed, the former as now (with a limit as now) and the latter just the same but applied after restricting res to the 'count' items after 'start'.

AndrewVSutherland · 2016-07-12T13:02:07Z

See #1404 for a related issue.

roed314 · 2019-11-09T09:19:36Z

There is partial progress on this issue in lmfdb/utils/downloader.py. Currently the mechanism used there to download doesn't use a stream (the __call__ method ends with a call to _wrap rather than _wrap_generator) but this could be changed reasonably easily if we want to enable downloads of large numbers of search results.

jwj61 · 2020-06-27T21:01:17Z

Another issue with the policy for downloads is how much data to give

Minimal information about the objects
Minimal plus everything showing in the search results
Everything we have for the objects

Via e-mail, there was a request which would have only been satisfied by option 3. It would be the slowest and take more time to implement, but if someone wants data on multiple objects and some of that data is hard to compute, 3 seems like the best solution.

edgarcosta · 2020-06-27T22:09:45Z

Getting a minimal version of 3 is easy, as we would be just dumping rows of a table, however, this might be unusable for most users.
On the other hand, some objects already have enough data associated to them that downloading all stored data to text fails.
For example: https://beta.lmfdb.org/ModularForm/GL2/Q/holomorphic/983/2/c/a/

jwj61 · 2020-06-27T22:14:19Z

The idea though, would be to output it in a format of the user's choosing, so some work would be needed in all areas to have output in magma, pari, etc formats.

jvoight · 2020-07-31T16:36:36Z

From associate editors conversation yesterday.

This LMFDB "off ramp" turned out to be considered to a priority for a
wide subset of editors. The general sense was that we want a flexible
ability to download, and that a "raw format" was less useful than
something to copy/paste into our favorite computer algebra systems.
One suggestion was to give control of columns of the search to
indicate what fields to download. No consensus emerged, so we need to
try some stuff and see what sticks.

roed314 · 2024-11-08T09:27:05Z

@jwj61, I think this issue can be closed in view of the changes in #5702 and followup issue #6221. If we don't want to close this, we should clarify what still needs to be done.

JohnCremona mentioned this issue Apr 25, 2016

made number field search result download more efficient #1084

Closed

AndrewVSutherland added the backend Backend server/database label May 13, 2016

AndrewVSutherland added this to the v1.2 milestone Jul 11, 2016

jenpaulhus mentioned this issue Jul 22, 2016

Proxy error for some higher genus w/autos download #1866

Closed

AndrewVSutherland added the downloads label Aug 1, 2020

jwj61 closed this as completed Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to have and implement general download policy #1079

Need to have and implement general download policy #1079

jwj61 commented Apr 21, 2016

JohnCremona commented Apr 22, 2016

haraldschilly commented Apr 22, 2016

jwj61 commented Apr 22, 2016

JohnCremona commented Apr 25, 2016

JohnCremona commented Apr 28, 2016

AndrewVSutherland commented Jul 12, 2016

roed314 commented Nov 9, 2019 •

edited

Loading

jwj61 commented Jun 27, 2020

edgarcosta commented Jun 27, 2020

jwj61 commented Jun 27, 2020

jvoight commented Jul 31, 2020

roed314 commented Nov 8, 2024 •

edited

Loading

Need to have and implement general download policy #1079

Need to have and implement general download policy #1079

Comments

jwj61 commented Apr 21, 2016

JohnCremona commented Apr 22, 2016

haraldschilly commented Apr 22, 2016

jwj61 commented Apr 22, 2016

JohnCremona commented Apr 25, 2016

JohnCremona commented Apr 28, 2016

AndrewVSutherland commented Jul 12, 2016

roed314 commented Nov 9, 2019 • edited Loading

jwj61 commented Jun 27, 2020

edgarcosta commented Jun 27, 2020

jwj61 commented Jun 27, 2020

jvoight commented Jul 31, 2020

roed314 commented Nov 8, 2024 • edited Loading

roed314 commented Nov 9, 2019 •

edited

Loading

roed314 commented Nov 8, 2024 •

edited

Loading