Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache strategy questions. #438

Open
mgautierfr opened this issue Oct 27, 2020 · 6 comments
Open

Cache strategy questions. #438

mgautierfr opened this issue Oct 27, 2020 · 6 comments

Comments

@mgautierfr
Copy link
Collaborator

Following the PR #430, here few open questions about the cache strategy in libzim :

  • Should we still use a dirent cache ? We have greatly improve the cache performance with previous commits but it seems that this cache may become useless as we are greatly reducing the search range and there are less chances that we search in the same range again.
  • We may introduce a cache url->dirent instead of the current one (index->dirent) as some content may be load several times (main pages, css, ...)
  • Should we keep the namespace cache ? Now we use it only to count the number of articles in each namespace. We don't use the boundary to search content.
@Jaifroid
Copy link

Jaifroid commented Nov 1, 2020

It seems to me that these questions could only be decided definitively with some metrics comparing time to first paint, binary search times (for complex searches) and time to complete load including all assets, in a large ZIM file on a lower-spec device (e.g. mobile).

All I can say is that in Kiwix JS, because JS file access is relatively slow, we have found a combination of a persistent assets cache and a separate Directory Entry cache (which speeds up binary search) provides very acceptable performance even with very slow file access (e.g. over a network).

@kelson42 kelson42 added this to the 7.3.0 milestone Feb 3, 2022
@kelson42
Copy link
Contributor

How much work would it be to build an url - dirent cache? To me it seems we coyld close this ticket if this is implemented.

@kelson42
Copy link
Contributor

@mgautierfr @veloman-yunkan Two years have passed. I wonder if the terms of the problems are still the same?

@veloman-yunkan
Copy link
Collaborator

  • We may introduce a cache url->dirent instead of the current one (index->dirent) as some content may be load several times (main pages, css, ...)

From my experience with zimcheck, the performance of the internal link check would benefit from such a cache. On the other hand, a custom cache in zimcheck that only keeps track if a given internal URL is valid or not will be more efficient. The fastest approach would be to create a read-only memory-efficient trie-like data structure for all entries of the ZIM-file and use it to answer the valid internal link queries.

@mgautierfr
Copy link
Collaborator Author

I wonder if the terms of the problems are still the same?

Not a lot have been done on this side since the last cache strategy work (not hidden improvement)
So the questions are still the same.

We have to do some measurement before trying to improve things even if the @veloman-yunkan analyse seems good.

@kelson42 kelson42 modified the milestones: 8.2.0, 8.3.0 Apr 6, 2023
@kelson42 kelson42 modified the milestones: 9.0.0, 9.1.0 Sep 26, 2023
@kelson42 kelson42 modified the milestones: 9.1.0, 10.0.0 Nov 1, 2023
@mgautierfr
Copy link
Collaborator Author

I have opened a new issue #946 to narrow down a bit the scope.

This issue is pretty wide about cache strategy and it is difficult (if not impossible) to really close it without proper measurement of when we spend time and so where we should add/remove/improve our cache strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants