Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to have and implement general download policy #1079

Closed
jwj61 opened this issue Apr 21, 2016 · 12 comments
Closed

Need to have and implement general download policy #1079

jwj61 opened this issue Apr 21, 2016 · 12 comments
Labels
backend Backend server/database downloads
Milestone

Comments

@jwj61
Copy link
Member

jwj61 commented Apr 21, 2016

Different areas let you try to download search results. When there are lots of results, this will fail with a proxy error which really comes from a timeout on the server. This can easily happen with number fields (with over 5 million fields in the database), and will happen in other areas as they expand.

Options I see:

  1. remove the download buttons when there are too many search results. Possibly provide a link to the api page instead
  2. put up a message saying results are being compiled, start a process on the server to make the file which is not subject to the timeout, and then provide the link when the file is ready. Files would have to be cleaned up afterward.

A drawback of (1) is that the download links make a customized version of the results whereas the api doesn't, and the result is not as friendly to systems (sage, gp, magma) where the data may be destined. Also, the api currently has a total limit on number of entries to be retrieved, although that can be changed. On the other hand, I am not sure how to make the side process which is not governed by the server timeout.

@JohnCremona
Copy link
Member

I don't really understand why the server should time out, when it can do the query. Is it caused by the fact that the query produces an iterator, and the iterator itself times out if left open for too long? I ran in to this a while back when I was looking through elliptic curves and doing some very nontrivial computations with each one.

If that is the case, here is a solution: (1) make sure that the query which produces the downlod data only delivers the data fields which are needed; (2) convert the iterator into a list as soon as it is created; (3) create the download file from that list instead of directly from the iterator.

For a quick test, try changing the donwload_search function call to give it list(res) instead of res and see if it still times out.

@haraldschilly
Copy link
Member

as you already said, a proper way to really address this download question is, to use a background task manager. the most natural fit for the lmfdb setup is http://www.celeryproject.org/

regarding the front-end, you are then periodically polling an endpoint giving you progress information or other status updates (or none), and at some point when the async task has finished, present the link to download the result.

drawback of all this is, that it roughly doubles the number of moving parts one has to take care of on the server. also, for local development, you need to implement some simple fallbacks, when such a task manager isn't available.

@jwj61
Copy link
Member Author

jwj61 commented Apr 22, 2016

Thanks Harald. It sounds like it is worth looking into.

@JohnCremona : This is not the same as a cursor going bad. If run locally, the current code works. The warwick servers for beta and prod have timeouts for web requests, and that is what is killing the download.

I have not had a chance to try your suggestion, but here is a reason why it is different. Say you do a search with a million results. The "find" takes no time -- it is essentially recording the parameters of your search. For the search results page, we use skip and limit, and then process say 20 objects. If the user then clicks download, we have to process a million objects, which will take roughly 50000 times as long.

@JohnCremona
Copy link
Member

In fact the point is that for each object to be downloaded, currently a WebNumberField object is created and that is taking much too long. I have a version of the code which avoids that and produces the exact same output.
Example: all quadratic fields, 1368485 search results, took about 2m15s to create the file. Not brilliant and perhaps worth a warning to the user, but quite possible and no timeout.
I'll make a pull request and then you can try it out.

@JohnCremona
Copy link
Member

On the same theme, if for example one searches for number fields with no constraints there are 5 million hits of which 20 are displayed, and one is not allowed to download the full search results since there are too many. But I can imagine a user just wanting to download the 20 fields which are displayed (or a larger number if they have asked to see more). This would be easy to implement (here and in all similar situations: we could have both download_all and download_displayed, the former as now (with a limit as now) and the latter just the same but applied after restricting res to the 'count' items after 'start'.

@AndrewVSutherland AndrewVSutherland added the backend Backend server/database label May 13, 2016
@AndrewVSutherland AndrewVSutherland added this to the v1.2 milestone Jul 11, 2016
@AndrewVSutherland
Copy link
Member

See #1404 for a related issue.

@roed314
Copy link
Contributor

roed314 commented Nov 9, 2019

There is partial progress on this issue in lmfdb/utils/downloader.py. Currently the mechanism used there to download doesn't use a stream (the __call__ method ends with a call to _wrap rather than _wrap_generator) but this could be changed reasonably easily if we want to enable downloads of large numbers of search results.

@jwj61
Copy link
Member Author

jwj61 commented Jun 27, 2020

Another issue with the policy for downloads is how much data to give

  1. Minimal information about the objects
  2. Minimal plus everything showing in the search results
  3. Everything we have for the objects

Via e-mail, there was a request which would have only been satisfied by option 3. It would be the slowest and take more time to implement, but if someone wants data on multiple objects and some of that data is hard to compute, 3 seems like the best solution.

@edgarcosta
Copy link
Member

Getting a minimal version of 3 is easy, as we would be just dumping rows of a table, however, this might be unusable for most users.
On the other hand, some objects already have enough data associated to them that downloading all stored data to text fails.
For example: https://beta.lmfdb.org/ModularForm/GL2/Q/holomorphic/983/2/c/a/

@jwj61
Copy link
Member Author

jwj61 commented Jun 27, 2020

The idea though, would be to output it in a format of the user's choosing, so some work would be needed in all areas to have output in magma, pari, etc formats.

@jvoight
Copy link
Member

jvoight commented Jul 31, 2020

From associate editors conversation yesterday.

This LMFDB "off ramp" turned out to be considered to a priority for a
wide subset of editors. The general sense was that we want a flexible
ability to download, and that a "raw format" was less useful than
something to copy/paste into our favorite computer algebra systems.
One suggestion was to give control of columns of the search to
indicate what fields to download. No consensus emerged, so we need to
try some stuff and see what sticks.

@roed314
Copy link
Contributor

roed314 commented Nov 8, 2024

@jwj61, I think this issue can be closed in view of the changes in #5702 and followup issue #6221. If we don't want to close this, we should clarify what still needs to be done.

@jwj61 jwj61 closed this as completed Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend server/database downloads
Projects
None yet
Development

No branches or pull requests

7 participants