-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistence: purge unreferenced Obj
s
#9688
Open
snazy
wants to merge
1
commit into
projectnessie:main
Choose a base branch
from
snazy:purge-unreferenced-objs
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
snazy
force-pushed
the
purge-unreferenced-objs
branch
10 times, most recently
from
October 9, 2024 18:09
5a8495a
to
9b93435
Compare
snazy
force-pushed
the
purge-unreferenced-objs
branch
from
October 10, 2024 09:12
9b93435
to
c47c0a6
Compare
snazy
changed the title
Persistence: purge unreferenced
Persistence: purge unreferenced Oct 10, 2024
Obj
s (WIP)Obj
s
snazy
force-pushed
the
purge-unreferenced-objs
branch
2 times, most recently
from
October 10, 2024 14:04
ef04484
to
d4a47b5
Compare
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Oct 10, 2024
Just implements the functionality to "shorten" the commit log history of a `Reference`. No API yet and this should still be optimized to reduce heap pressure by memoizing only `ObjId`s instead of `CommitObj`s. Exposing this functionality is not yet adviseable, because it would be meaningles w/o projectnessie#9688 in place and exposed as a functionality as well.
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Oct 10, 2024
Just implements the functionality to "shorten" the commit log history of a `Reference`. No API yet and this should still be optimized to reduce heap pressure by memoizing only `ObjId`s instead of `CommitObj`s. Exposing this functionality is not yet adviseable, because it would be meaningles w/o projectnessie#9688 in place and exposed as a functionality as well. Fixes projectnessie#9733
snazy
force-pushed
the
purge-unreferenced-objs
branch
4 times, most recently
from
October 14, 2024 12:01
f6ec1d3
to
f75d41b
Compare
snazy
added a commit
that referenced
this pull request
Oct 14, 2024
#9735) Just implements the functionality to "shorten" the commit log history of a `Reference`. No API yet and this should still be optimized to reduce heap pressure by memoizing only `ObjId`s instead of `CommitObj`s. Exposing this functionality is not yet adviseable, because it would be meaningles w/o #9688 in place and exposed as a functionality as well. Fixes #9733
snazy
force-pushed
the
purge-unreferenced-objs
branch
3 times, most recently
from
October 14, 2024 13:13
b5e8a37
to
871c84e
Compare
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Oct 14, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy
force-pushed
the
purge-unreferenced-objs
branch
from
October 14, 2024 17:47
871c84e
to
6b0158d
Compare
snazy
force-pushed
the
purge-unreferenced-objs
branch
from
October 15, 2024 08:11
6b0158d
to
4453b29
Compare
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Oct 15, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Oct 15, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Oct 15, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy
force-pushed
the
purge-unreferenced-objs
branch
from
October 16, 2024 07:45
4453b29
to
0e5bd73
Compare
snazy
force-pushed
the
purge-unreferenced-objs
branch
from
October 30, 2024 09:53
0e5bd73
to
86478ac
Compare
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Oct 30, 2024
Adds a new command to access the implementation provided by projectnessie#9688
snazy
force-pushed
the
purge-unreferenced-objs
branch
from
November 4, 2024 17:04
86478ac
to
1d1f8b6
Compare
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Nov 4, 2024
Adds a new command to access the implementation provided by projectnessie#9688
This is an attempt to implement the algorithm mentioned in the PR projectnessie#9401. The `Obj.referenced()` attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ... * set when an object is first persisted via a `storeObj()` * updated in the database, when an object was not persisted via `storeObj()` * set/updated via `upsertObj()` * updated via `updateConditional()` Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running. An approach could work as follows: 1. Memoize the current timestamp (minus some wall-clock drift adjustment). 2. Identify the IDs of all referenced objects. We could leverage a bloom filter, if the set of IDs is big. 3. Then scan all objects in the repository. Objects can be purged, if ... * the ID is not in the set (or bloom filter) generated in step 2 ... * _AND_ have a `referenced` timestamp less than the memoized timestamp. Any deletion in the backing database would follow the meaning of this pseudo SQL: `DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp`. Noting, that the `referenced` attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because that `referenced` attribute is irrelevant for production accesses. There are two edge cases / race conditions: * (for some backends): A `storeObj()` operation detected that the object already exists - then the purge routine deletes that object - and then the `storeObj()` tries to upddate the `referenced` attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced. * While the referenced objects are being identified, create a new named reference (branch / tag) pointing to commit(s) that would be identified as unreferenced and being later purged.
snazy
force-pushed
the
purge-unreferenced-objs
branch
from
November 11, 2024 13:35
1d1f8b6
to
defb191
Compare
snazy
added a commit
to snazy/nessie
that referenced
this pull request
Nov 11, 2024
Adds a new command to access the implementation provided by projectnessie#9688
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an attempt to implement the algorithm mentioned in the PR #9401.
The
Obj.referenced()
attribute contains the timestamp when the object was last "referenced" (aka: attempted to be written). It is ...storeObj()
storeObj()
upsertObj()
updateConditional()
Let's assume that there is a mechanism to identify the IDs of all referenced objects (it would be very similar to what the export functionality does). The algorithm to purge unreferenced objects must never delete an object that is referenced at any point of time, and must consider the case that an object that was unreferenced when a purge-unreferenced-objects routine started, but became referenced while it is running.
An approach could work as follows:
* the ID is not in the set (or bloom filter) generated in step 2 ...
* AND have a
referenced
timestamp less than the memoized timestamp.Any deletion in the backing database would follow the meaning of this pseudo SQL:
DELETE FROM objs WHERE obj_id = :objId AND referenced < :memoizedTimestamp
.Noting, that the
referenced
attribute is rather incorrect when retrieved from the objects cache (aka: during normal operations), which is not a problem, because thatreferenced
attribute is irrelevant for production accesses.In #9401 two edge cases / race conditions were identified:
storeObj()
operation detected that the object already exists - then the purge routine deletes that object - and then thestoreObj()
tries to update thereferenced
attribute. The result is the loss of that object. This race condition can only occur, if the object existed but was not referenced.deleteWithReferenced(Obj)
#9731 by adding theIF EXISTS
clause to the corresponding update of thereferenced
attribute.