Skip to content

[ENH]: sysdb changes to support moving collection hard deletes to garbage collector #4607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

codetheweb
Copy link
Contributor

@codetheweb codetheweb commented May 22, 2025

Description of changes

  • sysdb no longer has a flag for soft delete mode, soft delete is always enabled
  • sysdb no longer automatically transitions soft deleted collections to hard deleted
  • lineage file name is now uuidv7
    • It was previously based on the new target collection ID that was just added to the file, which works when adding dependencies but not when removing dependencies. UUIDv7 is time-sortable and random.
  • added BatchGetCollectionSoftDeleteStatus method (is called/tested in the next PR)

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

n/a

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link

⚠️ The Helm chart was updated without a version bump. Your changes will only be published if the version field in k8s/distributed-chroma/Chart.yaml is updated.

// SetDeleteMode sets the delete mode for testing
func (c *Coordinator) SetDeleteMode(mode DeleteMode) {
c.deleteMode = mode
func (s *Coordinator) BatchGetCollectionSoftDeleteStatus(ctx context.Context, req *coordinatorpb.BatchGetCollectionSoftDeleteStatusRequest) (*coordinatorpb.BatchGetCollectionSoftDeleteStatusResponse, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: should we instead extend the GetCollection method to allow returning soft deleted collections, and then get the statuses with multiple calls?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a single call to reduce the roundtrips. GetCollection should not return soft deleted collections for its current usages.

@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch 3 times, most recently from afe5084 to 9aed3e4 Compare May 22, 2025 20:11
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from c570721 to bcbf758 Compare May 22, 2025 22:23
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 9aed3e4 to ca845a3 Compare May 22, 2025 22:23
@codetheweb codetheweb marked this pull request as ready for review May 22, 2025 23:16
@codetheweb codetheweb requested a review from sanketkedia May 22, 2025 23:16
Copy link
Contributor

propel-code-bot bot commented May 22, 2025

[ENH]: sysdb changes to support moving collection hard deletes to garbage collector

This major PR re-architects the sysdb collection deletion flow by enabling soft delete as the only deletion mode and shifting responsibility for hard deletion of collections to an external garbage collector. It removes all prior server-side soft delete flags and automatic hard deletion logic, introduces distinct 'SoftDeleteCollection' and 'FinishCollectionDeletion' (hard delete) code paths, adapts the proto/grpc APIs accordingly, and ensures correct lineage file management with UUIDv7 filenames. A new batch API for efficiently querying soft delete status across multiple collections is also added.

Key Changes:
• Removed the soft-delete flag and automatic hard delete logic from sysdb; soft deletes are now always used
• Split deletion logic: 'SoftDeleteCollection' marks collections as deleted, while 'FinishCollectionDeletion' (hard delete) is meant for external GC use
• Removed in-process SoftDeleteCleaner/background cleanup from server implementation
• Migration of the hard-deletion API to the responsibility of an external GC, and added gRPC proto methods reflecting this flow
• Changed lineage file naming to use UUIDv7 for better time-ordering and conflict avoidance
• Added BatchGetCollectionSoftDeleteStatus API and backend for bulk-lookup of soft-deleted status for collections
• Removed all configuration options related to soft delete management from server and CLI
• Adapted table_catalog and related code to enforce invariants and protect lineage file updates with appropriate row-level locking
• Refactored tests to use direct calls to new code paths ('SoftDeleteCollection', 'FinishCollectionDeletion') instead of legacy unified deletion logic
• Updated proto definitions, mocks, and downstream call sites throughout the sysdb module

Affected Areas:
• sysdb/coordinator (deletion API, collection lifecycle, lineage management)
• sysdb/grpc (gRPC server and service implementation, removal of background cleaner)
• idl/chromadb/proto (protocol and service contract, collection deletion and batch status APIs)
• sysdb/metastore/db/dao (database access for deletion, batch soft delete status)
• sysdb/metastore/s3 (lineage/version file handling)
• Command-line flags, configuration, and test suites

Potential Impact:

Functionality: Collection hard deletes are no longer triggered by sysdb itself but must be issued explicitly (e.g., by a garbage collector process). All code and clients must now follow the new two-step deletion process. Legacy flows and flags for soft/hard delete are removed.

Performance: Reduced server-side cron-like activity for purging collections; batch APIs for soft delete status improve efficiency for clients like GC. Slight impact in GC latency depending on garbage collector implementation.

Security: No direct changes to authorization, but deletion pathways are now more explicit and controlled, reducing accidental permanent data loss.

Scalability: Should improve with externalization of collection purging and batch status fetches, and more scalable lineage management with UUIDv7.

Review Focus:
• Correctness of the new deletion flow: no accidental permanent deletes; no collection left in inconsistent state
• Race conditions around the updating or deleting of lineage files (check locking and error paths)
• Correct adaptation/removal of soft delete cleaner code and cron logic across sysdb/grpc
• Proto and gRPC API changes compatibility (ensure all call sites and generated code match)
• BatchGetCollectionSoftDeleteStatus implementation (edge cases on empty and mixed inputs)
• Removal/renaming of configs-ensure CLI, configs, and documentation are updated in sync

Testing Needed

• Verify the two-step deletion flow: collections should move to soft-delete state and only be hard-deleted via the FinishCollectionDeletion API
• Ensure lineage files are managed (including concurrent forks and deletes) without race
• Batch soft delete status API correctness for both existing, new, and missing collections
• All new and adapted integration/unit tests pass (especially for deletion edge-cases, lineage, and batched status)

Code Quality Assessment

go/pkg/sysdb/coordinator/coordinator.go: Cleaner interface for clients, well-abstracted two-step flow.

go/pkg/sysdb/metastore/s3/impl.go: Improved nil checks. Reviewer requested additional guards for robustness, which is acknowledged but could be further improved.

go/pkg/sysdb/coordinator/table_catalog.go: Substantial refactor, mostly clear, includes explicit row-level locking for potential race conditions. Some error handling improvements; edge cases are considered.

go/pkg/sysdb/grpc/collection_service.go: Clear separation of soft vs. hard delete, improved logging. New gRPC methods are well-documented.

go/pkg/sysdb/metastore/db/dao/collection.go: Batch queries implemented as expected; suggestion to check empty input for further efficiency; mostly concise.

idl/chromadb/proto/coordinator.proto: Major service definition changes; structure is consistent and well-documented.

Best Practices

Testing:
• Integration and unit tests updated and improved for new behavior

Config Hygiene:
• Removal of obsolete flags/configs

API Design:
• Separation of concern between soft and hard delete
• Batch status API for efficiency and atomicity

Robustness:
UUIDv7 for file naming prevents collisions and enables efficient sorting
• Reviewer suggests additional guards and comments for further robustness

Potential Issues

• Potential for race conditions on lineage file updates (mitigated with row locks, but should be carefully verified under load/concurrent operations)
• Some downstream consumers may expect the legacy unified DeleteCollection API behavior; must ensure comprehensive migration
• Lineage or version file handling could panic on malformed data or unexpected proto structures (as flagged by review comments)
• Danger if GC process is not deployed or if clients do not issue the hard delete step-collections would remain in limbo

This summary was automatically generated by @propel-code-bot

@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from bcbf758 to f72bd63 Compare May 22, 2025 23:24
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from ca845a3 to ccc9b82 Compare May 22, 2025 23:24
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from f72bd63 to c62bf3f Compare May 23, 2025 00:00
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from ccc9b82 to 4dcc3a4 Compare May 23, 2025 00:01
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from c62bf3f to d6bf939 Compare May 23, 2025 00:48
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 4dcc3a4 to f2adff1 Compare May 23, 2025 00:49
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from d6bf939 to e0a0f88 Compare May 23, 2025 01:27
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from f2adff1 to 660013f Compare May 23, 2025 01:27
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from e0a0f88 to 7f275bb Compare May 23, 2025 02:08
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 660013f to b10b4b0 Compare May 23, 2025 02:08
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from 7f275bb to a22cc8a Compare May 23, 2025 17:11
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from b10b4b0 to 77a9d62 Compare May 23, 2025 17:11
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch 2 times, most recently from 3cd7274 to 8f20043 Compare May 23, 2025 22:55
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch 2 times, most recently from 716f29a to 9356e10 Compare May 23, 2025 22:55
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from 8f20043 to 4a3b406 Compare May 23, 2025 23:36
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 9356e10 to 8253c18 Compare May 23, 2025 23:36
@codetheweb codetheweb force-pushed the feat-validate-file-paths-during-gc branch from 4a3b406 to e533b0c Compare May 27, 2025 17:22
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from 8253c18 to e1a6311 Compare May 27, 2025 17:22
return err
}
if rootCollection == nil {
return errors.New("root collection not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think there is an error constant for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not see one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ForkCollection we directly used common.ErrCollectionNotFound. Maybe we should have a separate error constant for this. I'm ok with this now

// Remove collection being deleted from the dependencies
updatedDependencies := make([]*coordinatorpb.CollectionVersionDependency, 0)
for _, dependency := range lineageFile.Dependencies {
if dependency.TargetCollectionId != deleteCollection.ID.String() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when dependency.SourceCollectionId == deleteCollection.ID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependency.SourceCollectionId == deleteCollection.ID should never evaluate to true at this point because only leaf nodes in the dependency tree are eligible for hard deletion
but this would be good to add as an invariant 👍

return err
}

tc.metaDomain.CollectionDb(txCtx).UpdateCollectionLineageFilePath(rootCollection.ID, rootCollection.LineageFileName, newLineageFileFullName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn: potential race condition with forkCollection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fixed with the lock I added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock seems to be on the deleted collection, while we need to lock the root collection to prevent race right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 whoops yes, we need to lock both
sorry tired brain today

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

careful about the lock order. not sure if a deadlock is possible as forking lock two collections as well

@codetheweb codetheweb changed the base branch from feat-validate-file-paths-during-gc to graphite-base/4607 May 27, 2025 23:03
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from e1a6311 to c85ee20 Compare May 27, 2025 23:03
@codetheweb codetheweb force-pushed the graphite-base/4607 branch from e533b0c to 3baca7c Compare May 27, 2025 23:03
@graphite-app graphite-app bot changed the base branch from graphite-base/4607 to main May 27, 2025 23:03
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch 2 times, most recently from c0914e0 to b2befca Compare May 27, 2025 23:13
@codetheweb codetheweb force-pushed the feat-gc-hard-delete-collection-sysdb branch from b2befca to 950e307 Compare May 28, 2025 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants