Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistence: purge unreferenced Objs #9688

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Commits on Nov 11, 2024

  1. Persistence: purge unreferenced Objs

    This is an attempt to implement the algorithm mentioned in the PR projectnessie#9401.
    
    The `Obj.referenced()` attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ...
    * set when an object is first persisted via a `storeObj()`
    * updated in the database, when an object was not persisted via `storeObj()`
    * set/updated via `upsertObj()`
    * updated via `updateConditional()`
    
    Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running.
    
    An approach could work as follows:
    
    1. Memoize the current timestamp (minus some wall-clock drift adjustment).
    2. Identify the IDs of all referenced objects. We could leverage a bloom filter, if the set of IDs is big.
    3. Then scan all objects in the repository. Objects can be purged, if ...
        * the ID is not in the set (or bloom filter) generated in step 2 ...
        * _AND_ have a `referenced` timestamp less than the memoized timestamp.
    
    Any deletion in the backing database would follow the meaning of this pseudo SQL: `DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp`.
    
    Noting, that the `referenced` attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because that `referenced` attribute is irrelevant for production accesses.
    
    There are two edge cases / race conditions:
    * (for some backends): A `storeObj()` operation detected that the object already exists - then the purge routine deletes that object - and then the `storeObj()` tries to upddate the `referenced` attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced.
    * While the referenced objects are being identified, create a new named reference (branch / tag) pointing to commit(s) that would be identified as unreferenced and being later purged.
    snazy committed Nov 11, 2024
    Configuration menu
    Copy the full SHA
    defb191 View commit details
    Browse the repository at this point in the history