Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cv): expose cache_bytes_limit #905

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

trivoldus28
Copy link
Contributor

cache_bytes_limit was not exposed in build_cv_layer like with ts. This is necessary for trainings to not go OOM

Copy link

codecov bot commented Feb 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (163b4fb) to head (fa1dd08).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #905   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          142       142           
  Lines         6110      6111    +1     
=========================================
+ Hits          6110      6111    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@dodamih dodamih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a good way around this, but we should note somewhere that if you initially initialise a CV with some cache byte limit (or the default) then trying to reinitialise will not change the byte limit.

@trivoldus28
Copy link
Contributor Author

Where should this note be placed?

@dodamih
Copy link
Collaborator

dodamih commented Feb 17, 2025

A little more complicated then just a note, but I think line 40-41 of zetta_utils/layer/volumetric/cloudvol/backend.py should probably check the cached backend's cache_bytes_limit and raise a ValueError if it's different from what's being requested. Better to be explicit.

@trivoldus28
Copy link
Contributor Author

Actually how does that scenario happen? _get_cv_cached is a fn internal to the file and everything that calls it does not change cache_bytes_limit, which is only set at construction

@trivoldus28
Copy link
Contributor Author

Ah I see the _cv_cache code. If you want I can add cache_bytes_limit to the index so you just get different instances

@dodamih
Copy link
Collaborator

dodamih commented Feb 17, 2025

I don't think it's massively better, but I still think just checking if the value is the same is a little safer - we don't want a user who doesn't know the internal workings to reinitialise accidentally thinking that they're just resizing the cache when they're making a fresh copy with a different cache size / separate cache and having to download all over again.

@trivoldus28
Copy link
Contributor Author

Just raising an error can be too restrictive I think. I can imagine scenarios where the user wants specify one thing but a routine elsewhere open the same path with the default value, forcing users to use that to avoid errors.

@trivoldus28
Copy link
Contributor Author

Maybe the best solution is to check and resize the lru cache to the smallest value seen? Seems easy to do.

Additionally, cachetools.LRUCache(maxsize=16) should be customizable. I think my training script uses more than 16. Should be use an environment variable?

@dodamih
Copy link
Collaborator

dodamih commented Feb 19, 2025

If CV lru is resizeable without evicting cache, we should definitely do that. @supersergiy Do you have thoughts on exposing the number of cached CV as an env variable?

@supersergiy
Copy link
Member

Can we just make the cloudvolume limit really large, like 256? We don't want to cycle layers in and out of memory for any of our existing usecases

@trivoldus28
Copy link
Contributor Author

Seems like a good idea, but maybe more like 1024? Off my head the current EDC spec already uses close to 256 vols:

6 merge thresholds * 4 layers (seg, gt seg, aff or emb, mask) * 9 cutouts = 216

I wonder if there's a drawback to make the LRU really big?

@supersergiy
Copy link
Member

It sounds like you got a huge training dataset. Are you sure all of this will fit into the workers memory? Because if not, lru cache will not really be helpful here

@trivoldus28
Copy link
Contributor Author

I'm running it with no lru cache (but with disk cache). I guess I don't know whether making a new CloudVolume() on every batch would be a significant overhead. Probably not, in which case 256 is reasonable, and probably safer to roll out as a global change.

@supersergiy
Copy link
Member

Ohh I get it now. I think 1024 is an absolutely reasonable limit. I think until we actually start running OOM because of it we shouldn't worry too much about evicting CloudVolumes from cache. Could shoot for even more than 1024

@trivoldus28
Copy link
Contributor Author

Then I'll go 2048 for future proofing :). I'll also use RRCache instead of LRU so it doesn't update a (potentially very long) list on every access (https://github.com/tkem/cachetools/blob/master/src/cachetools/__init__.py#L210-L214).

We could also think about refactoring it so that cachetools tracks sizes across all CVs and prevents accidental OOM altogether. From https://cachetools.readthedocs.io/en/latest/

In general, a cache’s size is the total size of its item’s values. Therefore, Cache provides a getsizeof() method, which returns the size of a given value.

It seems doable. CV provides the lru.nbytes interface to query how much it's currently holding (source, note: self.size is actually the specified max size).

The only complication of this scheme is that cachetools only updates item size on __setitem__ (src), so for accurate tracking you'd need to add extra _cv_cache[key] = cvol in backend.read/write() after data operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants