Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server admin tool: add command to purge unreferenced Objs #9753

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

snazy
Copy link
Member

@snazy snazy commented Oct 14, 2024

Adds a new command to access the implementation provided by #9688

@snazy snazy added this to the 0.100.0 milestone Oct 14, 2024
@snazy snazy force-pushed the admin-tool-purge-unreferenced branch 3 times, most recently from db40395 to ad9bb6d Compare October 15, 2024 12:42
@snazy snazy force-pushed the admin-tool-purge-unreferenced branch 2 times, most recently from 942da29 to d5cd449 Compare November 4, 2024 17:05
This is an attempt to implement the algorithm mentioned in the PR projectnessie#9401.

The `Obj.referenced()` attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ...
* set when an object is first persisted via a `storeObj()`
* updated in the database, when an object was not persisted via `storeObj()`
* set/updated via `upsertObj()`
* updated via `updateConditional()`

Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running.

An approach could work as follows:

1. Memoize the current timestamp (minus some wall-clock drift adjustment).
2. Identify the IDs of all referenced objects. We could leverage a bloom filter, if the set of IDs is big.
3. Then scan all objects in the repository. Objects can be purged, if ...
    * the ID is not in the set (or bloom filter) generated in step 2 ...
    * _AND_ have a `referenced` timestamp less than the memoized timestamp.

Any deletion in the backing database would follow the meaning of this pseudo SQL: `DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp`.

Noting, that the `referenced` attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because that `referenced` attribute is irrelevant for production accesses.

There are two edge cases / race conditions:
* (for some backends): A `storeObj()` operation detected that the object already exists - then the purge routine deletes that object - and then the `storeObj()` tries to upddate the `referenced` attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced.
* While the referenced objects are being identified, create a new named reference (branch / tag) pointing to commit(s) that would be identified as unreferenced and being later purged.
Adds a new command to access the implementation provided by projectnessie#9688
@snazy snazy force-pushed the admin-tool-purge-unreferenced branch from d5cd449 to 7a79323 Compare November 11, 2024 13:35
@snazy snazy modified the milestones: 0.100.0, 0.101.0 Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant