-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need to have and implement general download policy #1079
Comments
I don't really understand why the server should time out, when it can do the query. Is it caused by the fact that the query produces an iterator, and the iterator itself times out if left open for too long? I ran in to this a while back when I was looking through elliptic curves and doing some very nontrivial computations with each one. If that is the case, here is a solution: (1) make sure that the query which produces the downlod data only delivers the data fields which are needed; (2) convert the iterator into a list as soon as it is created; (3) create the download file from that list instead of directly from the iterator. For a quick test, try changing the donwload_search function call to give it list(res) instead of res and see if it still times out. |
as you already said, a proper way to really address this download question is, to use a background task manager. the most natural fit for the lmfdb setup is http://www.celeryproject.org/
regarding the front-end, you are then periodically polling an endpoint giving you progress information or other status updates (or none), and at some point when the async task has finished, present the link to download the result. drawback of all this is, that it roughly doubles the number of moving parts one has to take care of on the server. also, for local development, you need to implement some simple fallbacks, when such a task manager isn't available. |
Thanks Harald. It sounds like it is worth looking into. @JohnCremona : This is not the same as a cursor going bad. If run locally, the current code works. The warwick servers for beta and prod have timeouts for web requests, and that is what is killing the download. I have not had a chance to try your suggestion, but here is a reason why it is different. Say you do a search with a million results. The "find" takes no time -- it is essentially recording the parameters of your search. For the search results page, we use skip and limit, and then process say 20 objects. If the user then clicks download, we have to process a million objects, which will take roughly 50000 times as long. |
In fact the point is that for each object to be downloaded, currently a WebNumberField object is created and that is taking much too long. I have a version of the code which avoids that and produces the exact same output. |
On the same theme, if for example one searches for number fields with no constraints there are 5 million hits of which 20 are displayed, and one is not allowed to download the full search results since there are too many. But I can imagine a user just wanting to download the 20 fields which are displayed (or a larger number if they have asked to see more). This would be easy to implement (here and in all similar situations: we could have both download_all and download_displayed, the former as now (with a limit as now) and the latter just the same but applied after restricting res to the 'count' items after 'start'. |
See #1404 for a related issue. |
There is partial progress on this issue in |
Another issue with the policy for downloads is how much data to give
Via e-mail, there was a request which would have only been satisfied by option 3. It would be the slowest and take more time to implement, but if someone wants data on multiple objects and some of that data is hard to compute, 3 seems like the best solution. |
Getting a minimal version of 3 is easy, as we would be just dumping rows of a table, however, this might be unusable for most users. |
The idea though, would be to output it in a format of the user's choosing, so some work would be needed in all areas to have output in magma, pari, etc formats. |
From associate editors conversation yesterday. This LMFDB "off ramp" turned out to be considered to a priority for a |
Different areas let you try to download search results. When there are lots of results, this will fail with a proxy error which really comes from a timeout on the server. This can easily happen with number fields (with over 5 million fields in the database), and will happen in other areas as they expand.
Options I see:
A drawback of (1) is that the download links make a customized version of the results whereas the api doesn't, and the result is not as friendly to systems (sage, gp, magma) where the data may be destined. Also, the api currently has a total limit on number of entries to be retrieved, although that can be changed. On the other hand, I am not sure how to make the side process which is not governed by the server timeout.
The text was updated successfully, but these errors were encountered: