-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About cache limited by memory #947
Comments
After discussion with @kelson42, @rgaudin and @benoit74, here what we agree on and what I propose. We concentrate on cluster cache only for now (question about preloading cache configuration would be handled in #946). We assume that Dirent cache is pretty small in comparison to cluster cache and so we ignore it.
[1]
The idea is to use a fixed size (at the time of we add the cluster to the cache) and not have to recompute its size when we simply use the cluster. We change the configuration of cluster cache from number of cached cluster to "maximum" memory limit. Please note that:
|
OK to me but what seems to me mandatory is a global monitoring of memory consumption, not a global cache. If doing a global cache - which is "disconnected" of each ZIM reader' bring a lot of challenges then this should be considered carefully.
OK to me if it delivers a reasonably functionning cache ie: (1) prevent efficiently memory exhaustion (2) brings the value we expect from a cache=delivering immediatly often requested content.
OK to me if it allow a proper following of the cache size. I can not judge what implied exactly what is written.
No variable env. only API. This is feature only for the devs using the libzim.
This is 100% new to me, I would have expect the exact size of the cluster. What is here the worse case scenario? How many percent could be this approximation wrong? |
This is the whole purpose of this issue and the main reason I didn't want to do it right now without thinking twice about it.
This is due to:
The worse case scenario is the one when we always reading the last blob of all clusters in cache.
Difficult to say right now but: 1MB / (2MB * 1,1) => 45% memory allocated by decompression context is not a known value. It depends of the algorithm and the compression level used. But for zstd level 19 (our default), the window is of 8MB. So the memory allocated as at least 8MB. So 1MB / (2MB *1.1 + structure size + 8MB + other memory decompression context) => less than 10% |
@mgautierfr We need to stop talking about any memory needed to run the software. If we talk about the memory cache size, this is only the cache size, not the memory allocated to fill it. I don't understand your computation, but basically it seems that between the cache size measured/tracked and the real cache size used in memory, the error can be max 10%. Do I understand right? |
I don't speak about the memory needed to run the software. I speak about the memory that we will store (indirectly or not) in the cache.
This value depends of the algorithm used to compress the cluster and the cluster size. They are decided at zim file creation so, at reading, we cannot guaranty anything. |
OK, but considering the ZIM file landscape we have out there... there is almost 0 chance that we are wrong than more than 10% which is acceptable to me. |
Today, the internal cache system is limited by the number of item in the cache.
And this is a cache per zim file. So opening several zim files increase the limit of item saved.
On top of that, limiting per number of item is kind of useless for the user as we don't know the memory used by each item.
From a lot of discussion, it seems that we all agree that it would be better to limit the cache with a memory limit.
This issue tries to list all change that implies in libzim.
We have to compute the size of each item.
For dirents, it is the sum of:
For clusters, it is the sum of:
The variable size of the cluster means that used memory store in the cache may increase after we have added the items to the cache and without modification of the cache itself.
We have to create a global cache.
As we want to limit the cache memory size of the whole process, we want only one cache for the whole process.
(Or maybe only one limit shared by all caches)
There is two reunification to do:
So we have to extend the key from a u32 (
entry_index_type
orcluster_index_type
) to a tuple(u8[16], u8, u32)
(zim_uuid, type, index).So the key size goes from 4 bytes to 21 bytes (likely 24 bytes with padding).
As we have a cache to avoid parsing a (16 bytes + title + url) size buffer and the cache key will have to be hashed and compared each time we want to access an item, we may want to investigate a bit here.
If, in the future, we implement a cache
url->dirent
(as suggested in #438), the key itself will have to change againCache eviction
There are three moments when we want to remove entries from the cache:
First point is pretty easy. Other points need us to define a algorithm to use to select which item to remove.
A basic lru cache eviction, removing items until we have free enough memory, may be enough. Or not.
Out of scope:
This issue is only about limiting the memory usage.
Our cache system also contains other caches whom are (mainly) causing other issues than memory.
FastDirentLookup and yet to implement xapian preloading (#617) are consuming time at zim creation. #946 may still be necessary to configure them.
This will of course not limit the memory usage of the whole process. Opening too many zims and reading from them means that we have to keep at least several cluster in memory (not necessarily in cache). On low memory device it can be a problem. This is not new and pretty rare. It is just that limiting the cache will not magically fix that.
The text was updated successfully, but these errors were encountered: