Improve retrieval of original name #59

ross-spencer · 2020-09-22T15:30:07Z

Having used AIPscan a little more, leaving the name blank is strange
to look at, but we also cannot always retrieve the original name. We
do our best to a) retrieve the name in this commit, as well as b)
populate the name with the transfer UUID as an alternative.

Connected to #57

tw4l

Hi @ross-spencer, good investigative work here! I think you've hit on a good solution and the code looks good to me. I've left one suggestion for a clarifying comment and a question about one of the tests.

I see that you've opened a related mets-reader-writer issue. Forgive me if this is a naive question, but would it be better to make these changes to mets-reader-writer (i.e. add a helper to retrieve the transfer's original name, and more gracefully handle multiple dmdSecs with identical IDs) and then call the new helper here in AIPscan? You've done really good work of teasing this deceptively complicated question apart and it seems like there would be benefits to other projects of having this code in a shared helper library instead of only in AIPscan.

tw4l · 2020-09-22T19:16:37Z

AIPscan/Aggregator/mets_parse_helpers.py

@@ -49,31 +49,32 @@ def get_aip_original_name(mets):
    # Negated as we're going to want to remove this length of values.
    NAMESUFFIX = -len("-00000000-0000-0000-0000-000000000000")

+    # Other intellectual entities exist in the METS, i.e. for empty
+    # directories. We can identify those with an invalid name prefix.


It might be worth adding a comment that this name prefix is an indicator of an invalid <premis:originalName> value specifically within the context of a dmdSec. I was confused for a minute because the <premis:originalName> value in the amdSec of an original file also begins with "%transferDirectory%".

tw4l · 2020-09-22T19:25:09Z

AIPscan/Aggregator/tests/test_mets.py

-        (os.path.join("iso_mets", "iso_mets.xml"), "iso"),
+        (os.path.join("features_mets", "features-mets.xml"), "myTransfer", False),
+        (os.path.join("iso_mets", "iso_mets.xml"), "iso", False),
+        (os.path.join("original_name_mets", "dataverse_example.xml"), "", True),


Am I understanding correctly that this test raises a METSError because Archivematica has written two dmdSecs with an ID of "dmdSec_1" and mets-reader-writer isn't able to read two dmdSecs with the same ID? Forgive me if this naive, but with the naked eye it looks like like the <premis:originalName> value in the second <mets:dmdSec ID="dmdSec_1"> (starting on line 39) should be what we're looking for.

That is correct. I will add notes to the test. Related issue: archivematica/Issues#1310

ross-spencer · 2020-09-30T18:16:50Z

Thanks for the review @tw4l. You are somewhat right about the upstream issues. We had already determined an approach to handling those which was to not look at mets-reader-writer in the sprint and deliver value via the application itself. I don't feel the reactions to mets-reader-writer in this code are necessarily mutually exclusive of fixes there.

I have started an EPIC for us to build on in this repository: #53 and creating related issues in archivematica/issues. I feel we're approaching critical mass with regards to outstanding mets-reader-writer work so, as per the last review, suspect we'll be tackling those soon.

I have added comments to the above and hopefully made the intentions in the code and test clearer,,. but I couldn't see what explicit changes you wanted?

Just a note on marking "request for changes" in a PR. I recognize not everyone approaches our work like this - but we do have branch protection on this project so it's already explicit the PR can't be merged without approval - it try not to tell folks that's doubly the case. My personal approach is to leave a PR conversational by submitting comments unless there is something critically broken or out of keeping with our contributing gude that we should avoid at all costs. I tend to comment without explicit approval to create better understanding, ask questions, and that in most cases does result in some change to code. If in the rare case someone following their process approves a PR in the middle of a conversation and I am the first reviewer then I am comfortable with that as long as I know I've caught the things I had the capacity to catch that we can't really live with.

tw4l

Thanks @ross-spencer. I'm on board delivering value in the application itself and keeping track of potential upstream work for now (thanks for starting that Epic - that'll prove very helpful). I think that I had not yet been in a sprint meeting when I did the last review, so I was missing a bit of the context there.

You're right that I wasn't explict about any changes. I was trying to suggest (perhaps too implicitly) that upstream changes to mets-reader-writer to more gracefully handle the case of duplicative dmdSec IDs might result in downstream changes here - for instance, in the dataverse_example test returning the original value from the second "dmdSec=1" rather than throwing a METSError.

But given the sprint priorities and current focus, I'm quite happy to remove myself as an obstacle for merging - this is good work and definitely adding value. Thanks for adding the comments - I left one additional comment on a docstring typo but otherwise looks great.

Thanks for the PR process note as well - we do seem to be a bit inconsistent as a company in how/whether we use "request for changes" but I see your point and agree that a comment without explicit approval one way or the other would have been appropriate here.

tw4l · 2020-09-30T21:21:01Z

AIPscan/Aggregator/mets_parse_helpers.py

+    """Retrieve PREMIS original name from a METSDocument object.
+
+    If the original name cannot be reliably retrieved from the METS file
+    a METSError is exception is returned to be handled by the called as


I think there's a typo in "METSError is exception is returned"

Ah, good spot. Thanks Tessa.

Having used AIPscan a little more, leaving the name blank is strange to look at, but we also cannot always retrieve the original name. We do our best to a) retrieve the name in this commit, as well as b) populate the name with the transfer UUID as an alternative.

ross-spencer requested review from tw4l and peterVG September 22, 2020 15:30

ross-spencer force-pushed the dev/issue-57-transfer-name branch from 712ce9e to 8e6ec45 Compare September 22, 2020 15:32

tw4l suggested changes Sep 22, 2020

View reviewed changes

ross-spencer force-pushed the dev/issue-57-transfer-name branch from 8e6ec45 to ae08385 Compare September 30, 2020 17:56

ross-spencer requested a review from tw4l September 30, 2020 18:16

tw4l approved these changes Sep 30, 2020

View reviewed changes

ross-spencer force-pushed the dev/issue-57-transfer-name branch from ae08385 to 9a1e708 Compare September 30, 2020 21:54

ross-spencer merged commit 9a1e708 into main Sep 30, 2020

ross-spencer deleted the dev/issue-57-transfer-name branch September 30, 2020 21:57

peterVG mentioned this pull request Feb 16, 2021

AIP original name isn't always in the first dmdSec #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve retrieval of original name #59

Improve retrieval of original name #59

ross-spencer commented Sep 22, 2020

tw4l left a comment •

edited

Loading

tw4l Sep 22, 2020

tw4l Sep 22, 2020

ross-spencer Sep 30, 2020

ross-spencer commented Sep 30, 2020

tw4l left a comment

tw4l Sep 30, 2020

ross-spencer Sep 30, 2020

Improve retrieval of original name #59

Improve retrieval of original name #59

Conversation

ross-spencer commented Sep 22, 2020

tw4l left a comment • edited Loading

Choose a reason for hiding this comment

tw4l Sep 22, 2020

Choose a reason for hiding this comment

tw4l Sep 22, 2020

Choose a reason for hiding this comment

ross-spencer Sep 30, 2020

Choose a reason for hiding this comment

ross-spencer commented Sep 30, 2020

tw4l left a comment

Choose a reason for hiding this comment

tw4l Sep 30, 2020

Choose a reason for hiding this comment

ross-spencer Sep 30, 2020

Choose a reason for hiding this comment

tw4l left a comment •

edited

Loading