Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistence: purge unreferenced Objs #9688

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

snazy
Copy link
Member

@snazy snazy commented Oct 2, 2024

This is an attempt to implement the algorithm mentioned in the PR #9401.

The Obj.referenced() attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ...

  • set when an object is first persisted via a storeObj()
  • updated in the database, when an object was not persisted via storeObj()
  • set/updated via upsertObj()
  • updated via updateConditional()

Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running.

An approach could work as follows:

  1. Memoize the current timestamp (minus some wall-clock drift adjustment).
  2. Identify the IDs of all referenced objects. We could leverage a bloom filter, if the set of IDs is big.
  3. Then scan all objects in the repository. Objects can be purged, if ...
        * the ID is not in the set (or bloom filter) generated in step 2 ...
        * AND have a referenced timestamp less than the memoized timestamp.

Any deletion in the backing database would follow the meaning of this pseudo SQL: DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp.

Noting, that the referenced attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because that referenced attribute is irrelevant for production accesses.

In #9401 two edge cases / race conditions were identified:

  1. (for some backends): A storeObj() operation detected that the object already exists - then the purge routine deletes that object - and then the storeObj() tries to update the referenced attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced.
  1. While the referenced objects are being identified, create a new named reference (branch / tag) pointing to commit(s) that would be identified as unreferenced and being later purged.
  • This case is still valid.

@snazy snazy force-pushed the purge-unreferenced-objs branch 10 times, most recently from 5a8495a to 9b93435 Compare October 9, 2024 18:09
@snazy snazy changed the title Persistence: purge unreferenced Objs (WIP) Persistence: purge unreferenced Objs Oct 10, 2024
@snazy snazy force-pushed the purge-unreferenced-objs branch 2 times, most recently from ef04484 to d4a47b5 Compare October 10, 2024 14:04
snazy added a commit to snazy/nessie that referenced this pull request Oct 10, 2024
Just implements the functionality to "shorten" the commit log history of a `Reference`. No API yet and this should still be optimized to reduce heap pressure by memoizing only `ObjId`s instead of `CommitObj`s.

Exposing this functionality is not yet adviseable, because it would be meaningles w/o projectnessie#9688 in place and exposed as a functionality as well.
snazy added a commit to snazy/nessie that referenced this pull request Oct 10, 2024
Just implements the functionality to "shorten" the commit log history of a `Reference`. No API yet and this should still be optimized to reduce heap pressure by memoizing only `ObjId`s instead of `CommitObj`s.

Exposing this functionality is not yet adviseable, because it would be meaningles w/o projectnessie#9688 in place and exposed as a functionality as well.

Fixes projectnessie#9733
@snazy snazy added this to the 0.100.0 milestone Oct 10, 2024
@snazy snazy force-pushed the purge-unreferenced-objs branch 4 times, most recently from f6ec1d3 to f75d41b Compare October 14, 2024 12:01
snazy added a commit that referenced this pull request Oct 14, 2024
#9735)

Just implements the functionality to "shorten" the commit log history of a `Reference`. No API yet and this should still be optimized to reduce heap pressure by memoizing only `ObjId`s instead of `CommitObj`s.

Exposing this functionality is not yet adviseable, because it would be meaningles w/o #9688 in place and exposed as a functionality as well.

Fixes #9733
@snazy snazy force-pushed the purge-unreferenced-objs branch 3 times, most recently from b5e8a37 to 871c84e Compare October 14, 2024 13:13
snazy added a commit to snazy/nessie that referenced this pull request Oct 14, 2024
Adds a new command to access the implementation provided by projectnessie#9688
@snazy snazy marked this pull request as ready for review October 14, 2024 17:47
snazy added a commit to snazy/nessie that referenced this pull request Oct 15, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy added a commit to snazy/nessie that referenced this pull request Oct 15, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy added a commit to snazy/nessie that referenced this pull request Oct 15, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy added a commit to snazy/nessie that referenced this pull request Oct 30, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy added a commit to snazy/nessie that referenced this pull request Nov 4, 2024
Adds a new command to access the implementation provided by projectnessie#9688
This is an attempt to implement the algorithm mentioned in the PR projectnessie#9401.

The `Obj.referenced()` attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ...
* set when an object is first persisted via a `storeObj()`
* updated in the database, when an object was not persisted via `storeObj()`
* set/updated via `upsertObj()`
* updated via `updateConditional()`

Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running.

An approach could work as follows:

1. Memoize the current timestamp (minus some wall-clock drift adjustment).
2. Identify the IDs of all referenced objects. We could leverage a bloom filter, if the set of IDs is big.
3. Then scan all objects in the repository. Objects can be purged, if ...
    * the ID is not in the set (or bloom filter) generated in step 2 ...
    * _AND_ have a `referenced` timestamp less than the memoized timestamp.

Any deletion in the backing database would follow the meaning of this pseudo SQL: `DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp`.

Noting, that the `referenced` attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because that `referenced` attribute is irrelevant for production accesses.

There are two edge cases / race conditions:
* (for some backends): A `storeObj()` operation detected that the object already exists - then the purge routine deletes that object - and then the `storeObj()` tries to upddate the `referenced` attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced.
* While the referenced objects are being identified, create a new named reference (branch / tag) pointing to commit(s) that would be identified as unreferenced and being later purged.
snazy added a commit to snazy/nessie that referenced this pull request Nov 11, 2024
Adds a new command to access the implementation provided by projectnessie#9688
@snazy snazy modified the milestones: 0.100.0, 0.101.0 Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant