Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change collection publish logic to stop querying workflow service for each item in the collection #5175

Closed
andrewjbtw opened this issue Sep 20, 2024 · 3 comments
Assignees
Labels

Comments

@andrewjbtw
Copy link

Publishing large collections (100,000+ items) is not working. There are no clear error messages but evidence that the DSA servers are consuming too much memory and crashing. See this Slack thread for details.

Background

When a collection object is published, all the accessioned items in the collection are republished. Since the purl metadata is static, each item has to be updated when the collection changes so that the item purls will pick up things like changes to the collection title.

More specifically, when a collection object is published, some logic is applied to gather the list of items in the collection to publish. This logic applies some filtering:

  • no "Registered" items should be published
  • no "unpublished" items should be published
  • no "Opened" items should be published

The logic is querying the workflow service for item statuses and that seems to be the source of the recent problems with large collection publish.

New logic

In discussion with @justinlittman , we think we can change the logic to no longer query the workflows. The new logic should be:

When a collection object is published, publish the collection members that:

  • have a last closed version (meaning they are not Registered - they've been accessioned)
  • and there's cocina for that last closed version (meaning they've been closed at least once since we moved to the new version model)

We no longer have to skip all "Opened" items since the version model has introduced a change to how we handle republishing "Opened" objects. We only have to skip "Registered" objects (which are Opened, but still in v1) since those should not have Purls until after being closed.

Acceptance criteria

We need to be careful to avoid publishing items that should not be published. For testing this change, we need to verify that the following types of items do not get Purls after a collection object is published:

  • Registered items
  • Items with rights set to dark

The publish step should skip over items with rights set to dark, so we shouldn't have to exclude them from publish but we do need to verify that they can be handled correctly.

@andrewjbtw andrewjbtw added the bug label Sep 20, 2024
@andrewjbtw
Copy link
Author

For testing, I created a small collection on stage with various rights statuses and processing statuses. We can't test the scale aspect of this issue but we can verify that the change in logic doesn't have any unintended side effects in terms of publishing (or not publishing) individual druids: https://argo-stage.stanford.edu/view/druid:cz561bz3441

@mjgiarlo mjgiarlo self-assigned this Sep 25, 2024
@andrewjbtw
Copy link
Author

Closing this now as I've confirmed that the large collection I was trying to publish has now completed publishing.

@andrewjbtw
Copy link
Author

Since I don't see it linked in this issue, this was the related PR: #5178

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants