You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Publishing large collections (100,000+ items) is not working. There are no clear error messages but evidence that the DSA servers are consuming too much memory and crashing. See this Slack thread for details.
Background
When a collection object is published, all the accessioned items in the collection are republished. Since the purl metadata is static, each item has to be updated when the collection changes so that the item purls will pick up things like changes to the collection title.
More specifically, when a collection object is published, some logic is applied to gather the list of items in the collection to publish. This logic applies some filtering:
no "Registered" items should be published
no "unpublished" items should be published
no "Opened" items should be published
The logic is querying the workflow service for item statuses and that seems to be the source of the recent problems with large collection publish.
New logic
In discussion with @justinlittman , we think we can change the logic to no longer query the workflows. The new logic should be:
When a collection object is published, publish the collection members that:
have a last closed version (meaning they are not Registered - they've been accessioned)
and there's cocina for that last closed version (meaning they've been closed at least once since we moved to the new version model)
We no longer have to skip all "Opened" items since the version model has introduced a change to how we handle republishing "Opened" objects. We only have to skip "Registered" objects (which are Opened, but still in v1) since those should not have Purls until after being closed.
Acceptance criteria
We need to be careful to avoid publishing items that should not be published. For testing this change, we need to verify that the following types of items do not get Purls after a collection object is published:
Registered items
Items with rights set to dark
The publish step should skip over items with rights set to dark, so we shouldn't have to exclude them from publish but we do need to verify that they can be handled correctly.
The text was updated successfully, but these errors were encountered:
For testing, I created a small collection on stage with various rights statuses and processing statuses. We can't test the scale aspect of this issue but we can verify that the change in logic doesn't have any unintended side effects in terms of publishing (or not publishing) individual druids: https://argo-stage.stanford.edu/view/druid:cz561bz3441
Publishing large collections (100,000+ items) is not working. There are no clear error messages but evidence that the DSA servers are consuming too much memory and crashing. See this Slack thread for details.
Background
When a collection object is published, all the accessioned items in the collection are republished. Since the purl metadata is static, each item has to be updated when the collection changes so that the item purls will pick up things like changes to the collection title.
More specifically, when a collection object is published, some logic is applied to gather the list of items in the collection to publish. This logic applies some filtering:
The logic is querying the workflow service for item statuses and that seems to be the source of the recent problems with large collection publish.
New logic
In discussion with @justinlittman , we think we can change the logic to no longer query the workflows. The new logic should be:
When a collection object is published, publish the collection members that:
We no longer have to skip all "Opened" items since the version model has introduced a change to how we handle republishing "Opened" objects. We only have to skip "Registered" objects (which are Opened, but still in v1) since those should not have Purls until after being closed.
Acceptance criteria
We need to be careful to avoid publishing items that should not be published. For testing this change, we need to verify that the following types of items do not get Purls after a collection object is published:
The publish step should skip over items with rights set to dark, so we shouldn't have to exclude them from publish but we do need to verify that they can be handled correctly.
The text was updated successfully, but these errors were encountered: