-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit the cluster cache by memory instead of number of clusters #956
base: cache_config
Are you sure you want to change the base?
Changes from all commits
5d0a98a
216cfce
96e3db6
2d4c2b3
3b88a81
be24ef7
f41bb7f
28cd5e2
10134b0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,7 +86,8 @@ getClusterReader(const Reader& zimReader, offset_t offset, Cluster::Compression* | |
Cluster::Cluster(std::unique_ptr<IStreamReader> reader_, Compression comp, bool isExtended) | ||
: compression(comp), | ||
isExtended(isExtended), | ||
m_reader(std::move(reader_)) | ||
m_reader(std::move(reader_)), | ||
m_memorySize(0) | ||
{ | ||
if (isExtended) { | ||
read_header<uint64_t>(); | ||
|
@@ -180,3 +181,33 @@ getClusterReader(const Reader& zimReader, offset_t offset, Cluster::Compression* | |
} | ||
|
||
} | ||
|
||
// This function must return a constant size for a given cluster. | ||
// This is important as we want to remove the same size that what we add when we remove | ||
// the cluster from the cache. | ||
// However, because of partial decompression, this size can change: | ||
// - As we advance in the compression, we can create new blob readers in `m_blobReaders` | ||
// - The stream itself may allocate memory. | ||
// To solve this, we take the average and say a cluster's blob readers will half be created and | ||
// so we assume a readers size of half the full uncompressed cluster data size. | ||
// It also appears that when we get the size of the stream, we reach a state where no | ||
// futher allocation will be done by it. Probably because: | ||
// - We already started to decompresse the stream to read the offsets | ||
// - Cluster data size is smaller than window size associated to compression level (?) | ||
// We anyway check that and print a warning if this is not the case, hopping that user will create | ||
// an issue allowing us for further analysis. | ||
size_t zim::ClusterMemorySize::get_cluster_size(const Cluster& cluster) { | ||
if (!cluster.m_memorySize) { | ||
auto base_struct = sizeof(Cluster); | ||
auto offsets_size = sizeof(offset_t) * cluster.m_blobOffsets.size(); | ||
auto readers_size = cluster.m_blobOffsets.back().v / 2; | ||
cluster.m_streamSize = cluster.m_reader->getSize(); | ||
cluster.m_memorySize = base_struct + offsets_size + readers_size + cluster.m_streamSize; | ||
} | ||
auto streamSize = cluster.m_reader->getSize(); | ||
if (streamSize != cluster.m_streamSize) { | ||
std::cerr << "WARNING: stream size have changed from " << cluster.m_streamSize << " to " << streamSize << std::endl; | ||
std::cerr << "Please open an issue on https://github.com/openzim/libzim/issues with this message and the zim file you use" << std::endl; | ||
} | ||
return cluster.m_memorySize; | ||
} | ||
Comment on lines
+185
to
+213
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that for our next revision of the ZIM file format we should consider including the uncompressed size of a compressed cluster in the cluster header. Also, though this is not the best place for discussing such an idea, I propose to think about possible benefits of introducing an item data cache. In some scenarios, clusters may be too large a unit of caching. Consider a usage pattern when multiple relatively small items all from different clusters are constantly used. The total memory consumed by those items can be orders of magnitude smaller than the size of their clusters. And if the latter exceeds the cluster cache limit, libzim may keep thrashing for no good reason. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is what I have done in Jubako. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,6 +36,8 @@ namespace zim | |
class Blob; | ||
class Reader; | ||
class IStreamReader; | ||
class Cluster; | ||
|
||
|
||
class LIBZIM_PRIVATE_API Cluster : public std::enable_shared_from_this<Cluster> { | ||
typedef std::vector<offset_t> BlobOffsets; | ||
|
@@ -70,6 +72,8 @@ namespace zim | |
|
||
mutable std::mutex m_readerAccessMutex; | ||
mutable BlobReaders m_blobReaders; | ||
mutable size_t m_memorySize; | ||
mutable size_t m_streamSize; | ||
|
||
|
||
template<typename OFFSET_TYPE> | ||
|
@@ -91,6 +95,15 @@ namespace zim | |
Blob getBlob(blob_index_t n, offset_t offset, zsize_t size) const; | ||
|
||
static std::shared_ptr<Cluster> read(const Reader& zimReader, offset_t clusterOffset); | ||
friend struct ClusterMemorySize; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't it better to have a public |
||
}; | ||
|
||
struct ClusterMemorySize { | ||
static size_t cost(const std::shared_ptr<const Cluster>& cluster) { | ||
return get_cluster_size(*cluster); | ||
} | ||
|
||
static size_t get_cluster_size(const Cluster& cluster); | ||
}; | ||
|
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be that precise in your approximate estimation of the memory usage by the cluster, shouldn't you multiply
readers_size
by the additional memory usage of an individual blob reader?