feat(cv): expose cache_bytes_limit #905

trivoldus28 · 2025-02-14T23:03:52Z

cache_bytes_limit was not exposed in build_cv_layer like with ts. This is necessary for trainings to not go OOM

codecov · 2025-02-14T23:13:41Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (163b4fb) to head (fa1dd08).

Additional details and impacted files

@@            Coverage Diff            @@
##              main      #905   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          142       142           
  Lines         6110      6111    +1     
=========================================
+ Hits          6110      6111    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dodamih

I don't think there is a good way around this, but we should note somewhere that if you initially initialise a CV with some cache byte limit (or the default) then trying to reinitialise will not change the byte limit.

trivoldus28 · 2025-02-17T15:46:15Z

Where should this note be placed?

dodamih · 2025-02-17T22:11:01Z

A little more complicated then just a note, but I think line 40-41 of zetta_utils/layer/volumetric/cloudvol/backend.py should probably check the cached backend's cache_bytes_limit and raise a ValueError if it's different from what's being requested. Better to be explicit.

trivoldus28 · 2025-02-17T23:10:47Z

Actually how does that scenario happen? _get_cv_cached is a fn internal to the file and everything that calls it does not change cache_bytes_limit, which is only set at construction

trivoldus28 · 2025-02-17T23:14:22Z

Ah I see the _cv_cache code. If you want I can add cache_bytes_limit to the index so you just get different instances

dodamih · 2025-02-17T23:47:41Z

I don't think it's massively better, but I still think just checking if the value is the same is a little safer - we don't want a user who doesn't know the internal workings to reinitialise accidentally thinking that they're just resizing the cache when they're making a fresh copy with a different cache size / separate cache and having to download all over again.

trivoldus28 · 2025-02-18T15:55:14Z

Just raising an error can be too restrictive I think. I can imagine scenarios where the user wants specify one thing but a routine elsewhere open the same path with the default value, forcing users to use that to avoid errors.

trivoldus28 · 2025-02-18T15:57:59Z

Maybe the best solution is to check and resize the lru cache to the smallest value seen? Seems easy to do.

Additionally, cachetools.LRUCache(maxsize=16) should be customizable. I think my training script uses more than 16. Should be use an environment variable?

dodamih · 2025-02-19T01:49:54Z

If CV lru is resizeable without evicting cache, we should definitely do that. @supersergiy Do you have thoughts on exposing the number of cached CV as an env variable?

supersergiy · 2025-02-20T18:28:35Z

Can we just make the cloudvolume limit really large, like 256? We don't want to cycle layers in and out of memory for any of our existing usecases

trivoldus28 · 2025-02-20T20:38:46Z

Seems like a good idea, but maybe more like 1024? Off my head the current EDC spec already uses close to 256 vols:

6 merge thresholds * 4 layers (seg, gt seg, aff or emb, mask) * 9 cutouts = 216

I wonder if there's a drawback to make the LRU really big?

supersergiy · 2025-02-20T21:17:50Z

It sounds like you got a huge training dataset. Are you sure all of this will fit into the workers memory? Because if not, lru cache will not really be helpful here

trivoldus28 · 2025-02-20T22:05:47Z

I'm running it with no lru cache (but with disk cache). I guess I don't know whether making a new CloudVolume() on every batch would be a significant overhead. Probably not, in which case 256 is reasonable, and probably safer to roll out as a global change.

supersergiy · 2025-02-20T23:16:58Z

Ohh I get it now. I think 1024 is an absolutely reasonable limit. I think until we actually start running OOM because of it we shouldn't worry too much about evicting CloudVolumes from cache. Could shoot for even more than 1024

trivoldus28 · 2025-02-21T22:12:25Z

Then I'll go 2048 for future proofing :). I'll also use RRCache instead of LRU so it doesn't update a (potentially very long) list on every access (https://github.com/tkem/cachetools/blob/master/src/cachetools/__init__.py#L210-L214).

We could also think about refactoring it so that cachetools tracks sizes across all CVs and prevents accidental OOM altogether. From https://cachetools.readthedocs.io/en/latest/

In general, a cache’s size is the total size of its item’s values. Therefore, Cache provides a getsizeof() method, which returns the size of a given value.

It seems doable. CV provides the lru.nbytes interface to query how much it's currently holding (source, note: self.size is actually the specified max size).

The only complication of this scheme is that cachetools only updates item size on __setitem__ (src), so for accurate tracking you'd need to add extra _cv_cache[key] = cvol in backend.read/write() after data operations.

feat(cv): expose cache_bytes_limit

fa1dd08

trivoldus28 requested review from dodamih and supersergiy February 14, 2025 23:03

dodamih reviewed Feb 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cv): expose cache_bytes_limit #905

feat(cv): expose cache_bytes_limit #905

trivoldus28 commented Feb 14, 2025

codecov bot commented Feb 14, 2025

dodamih left a comment

trivoldus28 commented Feb 17, 2025

dodamih commented Feb 17, 2025

trivoldus28 commented Feb 17, 2025

trivoldus28 commented Feb 17, 2025

dodamih commented Feb 17, 2025 •

edited

Loading

trivoldus28 commented Feb 18, 2025

trivoldus28 commented Feb 18, 2025

dodamih commented Feb 19, 2025

supersergiy commented Feb 20, 2025

trivoldus28 commented Feb 20, 2025

supersergiy commented Feb 20, 2025

trivoldus28 commented Feb 20, 2025

supersergiy commented Feb 20, 2025

trivoldus28 commented Feb 21, 2025

feat(cv): expose cache_bytes_limit #905

Are you sure you want to change the base?

feat(cv): expose cache_bytes_limit #905

Conversation

trivoldus28 commented Feb 14, 2025

codecov bot commented Feb 14, 2025

Codecov Report

dodamih left a comment

Choose a reason for hiding this comment

trivoldus28 commented Feb 17, 2025

dodamih commented Feb 17, 2025

trivoldus28 commented Feb 17, 2025

trivoldus28 commented Feb 17, 2025

dodamih commented Feb 17, 2025 • edited Loading

trivoldus28 commented Feb 18, 2025

trivoldus28 commented Feb 18, 2025

dodamih commented Feb 19, 2025

supersergiy commented Feb 20, 2025

trivoldus28 commented Feb 20, 2025

supersergiy commented Feb 20, 2025

trivoldus28 commented Feb 20, 2025

supersergiy commented Feb 20, 2025

trivoldus28 commented Feb 21, 2025

dodamih commented Feb 17, 2025 •

edited

Loading