-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pagination #962
Comments
Looks like we could use https://uriyyo-fastapi-pagination.netlify.app for Jupyverse. |
Cross-referencing previous related discussion: #538 #538 jupyterlab/jupyterlab#8700 |
Thanks for opening this issue. Do we know where the cost is occurring? (I apologize that I haven't caught up on the related discussions in Lab.) If I touch 10,000 files, the GET request is around 1.7 secs (which isn't cheap, although direct Is the idea that we'd capture the result-set and hold on to it across (stateful) requests? Since If we were to forgo state management, we could recalculate on each access but could encounter "missing" or "duplicate" items if file-system changes affected a previously traversed page. This approach doesn't seem very user-friendly. I also see Lab repeating the GET request about every 10 seconds or so, presumably in an attempt to keep its presentation consistent with the actual "file system". I'm assuming this would constitute another series of traversals (i.e., a fresh result-set) and essentially enable the ability to "see changes" that occurred during the previous traversal (as is the case today). As a result, we'd need to be able to distinguish the difference between a fresh GET request vs. a traversal GET request that happens to use the same parameters (
I'm not following the concern here. Isn't the last page specified in the link header? |
I don't think so, we want to keep the server state-less. I think that if a client wants a paginated result, it takes a risk of getting inconsistent results. |
I agree with this in principle, although the server is already stateful when it comes to kernel management (unless you proxy to a Gateway server) and Sessions for that matter. Introducing state is definitely a commitment and something we should be in agreement with - which is why I asked if we know where the cost really is. If the cost is building the response then perhaps we can introduce an option that responds with a "raw payload". But if the true cost is in sorting and rendering the results, it seems like this should be addressed in the client (which appears to be what much of the discussion is about in the lab issues). If we retained a stateless approach but implemented paging, we're looking at the 7-8 ms (for large directories) plus sorting of the entire results per page requested, along with the cost to build the response for that page. This would make things worse than they are now since we'd be multiplying the overall cost by a factor of N (pages) that would otherwise be amortized in a stateful approach. This seems like an expensive proposition only to reduce the payload to a page of results. |
There is no sorting needed on the server side, right? |
If the results of the |
But I don't know if "arbitrary order" means "not reproducible order"? If there is no particular order but two consecutive calls on the same file system yield the same result, that's fine. |
And from what I can see, it is reproducible. |
True, (and great!) although I think the end-user experience will be odd since you'd see results from each page interleaved amongst the existing pages because that display is sorted. As a result, the user wouldn't see the correct set of entries until all entries have been fetched. |
I'd say the end-user experience if the responsibility of the client. It could display a spinner showing that the directory content is being rendered, for instance. |
I am not following the discussion here closely, but wanted to voice
agreement that the REST APIs should remain stateless, especially as the
server is used in multi-user settings now.
…On Wed, Sep 7, 2022 at 9:07 AM David Brochart ***@***.***> wrote:
I'd say the end-user experience if the responsibility of the client. It
could display a spinner showing that the directory content is being
rendered, for instance.
—
Reply to this email directly, view it on GitHub
<#962 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAGXUGXITAXT2R6R7NYXJLV5C4UTANCNFSM6AAAAAAQEZL6FI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Brian E. Granger
Senior Principal Technologist, AWS AI/ML ***@***.***)
On Leave - Professor of Physics and Data Science, Cal Poly
@ellisonbg on GitHub
|
Yes, I believe we're all in agreement here. It's really a matter of protecting the server from making things worse and paging w/o state, IMHO, makes things worse - for both the server and end-user. @fcollonval @krassowski - what would Lab do with paging? Would it retain a partial result set based on where the user's cursor is within the results - discarding other results that are (sufficiently) "out of view"? Would it assume the results are ordered relative to the complete result set? Or would it essentially build a complete result set, retaining all previous sets (which would imply the crux of the issue is simply the response containing the complete results)? |
Thanks for the discussion. Preliminary comment: to clarify the proposal, the new query arguments will be transmitted to the content manager. So virtual file system backed by database can optimize requests.
The first bottleneck is usually the JupyterLab frontend (too much DOM nodes to be handled). But in a previous discussion, developers at Siemens saw that the cost of calling
I would go for a stateless request too. I tested in Gmail; there, pagination is stateless because if you receive emails when navigating, the page N will not match the previous page N as elements will be shifted by number of new elements received. I think this is fine. But it will forbid endless scrolling in frontend. Regarding the sorting of items, this is actually a pretty good point. It seems that we should in addition to pagination add query argument for ordering. But that raises the question of which value would be accepted for ordering; it seems that in Notebook and JupyterLab only the name and the last modification time are available. But the name ordering is not guaranteed by definition of
Indeed this is to keep the displayed list consistent with the backend content.
The question is how to return the last page from the
From all previous elements, the best approach will be a panel displaying only one page at a time (that gets refresh periodically or when we file system actions is triggered from the interface). Lately to add one more element in the discussion, in JupyterLab the user can filter the list of folder items. To be consistent with pagination, this sounds like yet another query arguments to pass to the backend... Another approach A simpler API could probably be to only support pagination to provide directory listing by chunks for the frontend. So the UI can quickly show something but the final list will only be available when all chunks have been obtained. Something similar to endless scrolling list - but with side effect as sorting and filter may change the order of the initial list. |
Maybe we should wait and take advantage of the event system to implement that feature. I think it would be a good fit because in the case of huge directories, polling the REST API at regular intervals is extremely inefficient anyway, as the server basically has to send almost the same information repeatedly. With the event system, we can send only the file system changes, which should be very small even for big directories. Initially sending the whole directory content could be done when a client connects, and in chunks. |
I've gone ahead and added this to the agenda for tomorrow's Jupyter Server meeting for those interested in this topic. I hope the interested parties can join. |
I would rather not mix API to achieve this by virtue of the separation of concerns. |
I didn't mean to use both APIs, but one or the other. |
The event API will not provide you the initial directory listing, is it? So how will you get it except by using the contents API?
This is partially true as the contents API is still used for anything other than file edition supporting collaboration. |
Yes, we can decide that it provides the initial directory listing on client connection. |
IIUC, lab only issues content requests for the currently viewed directory. So if the user navigates to another directory (even a sub-directory of the former), it must use the Contents service to acquire the initial set of files. It seems that an event-based approach would only satisfy changes to the existing view (directory). Is that correct? |
The event system could send initial content and changes of the whole directory tree (including sub-directories). |
Some discussion points from today's meeting that might be simpler than introducing pagination:
|
Advantages of the above: Relies on standard HTTP headers to be do all the needed changes, so that clients can negotiate with the server without any breakages on either side. E.g.
|
Note that there is also improvements to be made on the server side by using https://docs.python.org/3/library/os.html#os.scandir instead of
|
Problem
It will be good to support optionally pagination for get requests (at least for the content, sessions and kernels are probably less relevant).
Proposed Solution
Implement pagination using GitHub logic (except the cursor-based pagination): https://docs.github.com/en/rest/overview/resources-in-the-rest-api#pagination
The idea is to implement this in a backward compatible manner so that
ContentsManager
that do not accept pagination will work and no Link header will be set to notify the frontend. And vice-versa, frontend not sending query argument per_page will receive all elements.On the Python side,
ContentsManager.get
will have two new keywords:page
andper_page
.The question is how to return the last page. I propose to add it to the model returned by
get
as an optional attribute.The handler will take care of setting the Link header if pagination is supported.
Pagination
Requests that return multiple items will be paginated to 30 items by default. You can specify further pages with the page parameter. For some resources, you can also set a custom page size up to 100 with the per_page parameter. Note that for technical reasons not all endpoints respect the per_page parameter, see events for example.
For more information on pagination, check out our guide on Traversing with Pagination.
Link header
Note: It's important to form calls with Link header values instead of constructing your own URLs.
The Link header includes pagination information. For example:
The example includes a line break for readability.
This Link response header contains one or more Hypermedia link relations, some of which may require expansion as URI templates.
The possible rel values are:
Additional context
UI responsiveness is highly reduced when a folder contains lots of items. This will allow to improve that.
The text was updated successfully, but these errors were encountered: