-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cv): expose cache_bytes_limit #905
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #905 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 142 142
Lines 6110 6111 +1
=========================================
+ Hits 6110 6111 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a good way around this, but we should note somewhere that if you initially initialise a CV with some cache byte limit (or the default) then trying to reinitialise will not change the byte limit.
Where should this note be placed? |
A little more complicated then just a note, but I think line 40-41 of zetta_utils/layer/volumetric/cloudvol/backend.py should probably check the cached backend's |
Actually how does that scenario happen? |
Ah I see the |
I don't think it's massively better, but I still think just checking if the value is the same is a little safer - we don't want a user who doesn't know the internal workings to reinitialise accidentally thinking that they're just resizing the cache when they're making a fresh copy with a different cache size / separate cache and having to download all over again. |
Just raising an error can be too restrictive I think. I can imagine scenarios where the user wants specify one thing but a routine elsewhere open the same path with the default value, forcing users to use that to avoid errors. |
Maybe the best solution is to check and resize the lru cache to the smallest value seen? Seems easy to do. Additionally, |
If CV lru is resizeable without evicting cache, we should definitely do that. @supersergiy Do you have thoughts on exposing the number of cached CV as an env variable? |
Can we just make the cloudvolume limit really large, like 256? We don't want to cycle layers in and out of memory for any of our existing usecases |
Seems like a good idea, but maybe more like 1024? Off my head the current EDC spec already uses close to 256 vols: 6 merge thresholds * 4 layers (seg, gt seg, aff or emb, mask) * 9 cutouts = 216 I wonder if there's a drawback to make the LRU really big? |
It sounds like you got a huge training dataset. Are you sure all of this will fit into the workers memory? Because if not, lru cache will not really be helpful here |
I'm running it with no lru cache (but with disk cache). I guess I don't know whether making a new |
Ohh I get it now. I think 1024 is an absolutely reasonable limit. I think until we actually start running OOM because of it we shouldn't worry too much about evicting CloudVolumes from cache. Could shoot for even more than 1024 |
Then I'll go 2048 for future proofing :). I'll also use RRCache instead of LRU so it doesn't update a (potentially very long) list on every access (https://github.com/tkem/cachetools/blob/master/src/cachetools/__init__.py#L210-L214). We could also think about refactoring it so that
It seems doable. CV provides the The only complication of this scheme is that |
cache_bytes_limit
was not exposed inbuild_cv_layer
like with ts. This is necessary for trainings to not go OOM