Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retrieval of original name #59

Merged
merged 1 commit into from
Sep 30, 2020
Merged

Conversation

ross-spencer
Copy link
Contributor

Having used AIPscan a little more, leaving the name blank is strange
to look at, but we also cannot always retrieve the original name. We
do our best to a) retrieve the name in this commit, as well as b)
populate the name with the transfer UUID as an alternative.

Connected to #57

Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ross-spencer, good investigative work here! I think you've hit on a good solution and the code looks good to me. I've left one suggestion for a clarifying comment and a question about one of the tests.

I see that you've opened a related mets-reader-writer issue. Forgive me if this is a naive question, but would it be better to make these changes to mets-reader-writer (i.e. add a helper to retrieve the transfer's original name, and more gracefully handle multiple dmdSecs with identical IDs) and then call the new helper here in AIPscan? You've done really good work of teasing this deceptively complicated question apart and it seems like there would be benefits to other projects of having this code in a shared helper library instead of only in AIPscan.

@@ -49,31 +49,32 @@ def get_aip_original_name(mets):
# Negated as we're going to want to remove this length of values.
NAMESUFFIX = -len("-00000000-0000-0000-0000-000000000000")

# Other intellectual entities exist in the METS, i.e. for empty
# directories. We can identify those with an invalid name prefix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a comment that this name prefix is an indicator of an invalid <premis:originalName> value specifically within the context of a dmdSec. I was confused for a minute because the <premis:originalName> value in the amdSec of an original file also begins with "%transferDirectory%".

(os.path.join("iso_mets", "iso_mets.xml"), "iso"),
(os.path.join("features_mets", "features-mets.xml"), "myTransfer", False),
(os.path.join("iso_mets", "iso_mets.xml"), "iso", False),
(os.path.join("original_name_mets", "dataverse_example.xml"), "", True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that this test raises a METSError because Archivematica has written two dmdSecs with an ID of "dmdSec_1" and mets-reader-writer isn't able to read two dmdSecs with the same ID? Forgive me if this naive, but with the naked eye it looks like like the <premis:originalName> value in the second <mets:dmdSec ID="dmdSec_1"> (starting on line 39) should be what we're looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. I will add notes to the test. Related issue: archivematica/Issues#1310

@ross-spencer
Copy link
Contributor Author

Thanks for the review @tw4l. You are somewhat right about the upstream issues. We had already determined an approach to handling those which was to not look at mets-reader-writer in the sprint and deliver value via the application itself. I don't feel the reactions to mets-reader-writer in this code are necessarily mutually exclusive of fixes there.

I have started an EPIC for us to build on in this repository: #53 and creating related issues in archivematica/issues. I feel we're approaching critical mass with regards to outstanding mets-reader-writer work so, as per the last review, suspect we'll be tackling those soon.

I have added comments to the above and hopefully made the intentions in the code and test clearer,,. but I couldn't see what explicit changes you wanted?

Just a note on marking "request for changes" in a PR. I recognize not everyone approaches our work like this - but we do have branch protection on this project so it's already explicit the PR can't be merged without approval - it try not to tell folks that's doubly the case. My personal approach is to leave a PR conversational by submitting comments unless there is something critically broken or out of keeping with our contributing gude that we should avoid at all costs. I tend to comment without explicit approval to create better understanding, ask questions, and that in most cases does result in some change to code. If in the rare case someone following their process approves a PR in the middle of a conversation and I am the first reviewer then I am comfortable with that as long as I know I've caught the things I had the capacity to catch that we can't really live with.

Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ross-spencer. I'm on board delivering value in the application itself and keeping track of potential upstream work for now (thanks for starting that Epic - that'll prove very helpful). I think that I had not yet been in a sprint meeting when I did the last review, so I was missing a bit of the context there.

You're right that I wasn't explict about any changes. I was trying to suggest (perhaps too implicitly) that upstream changes to mets-reader-writer to more gracefully handle the case of duplicative dmdSec IDs might result in downstream changes here - for instance, in the dataverse_example test returning the original value from the second "dmdSec=1" rather than throwing a METSError.

But given the sprint priorities and current focus, I'm quite happy to remove myself as an obstacle for merging - this is good work and definitely adding value. Thanks for adding the comments - I left one additional comment on a docstring typo but otherwise looks great.

Thanks for the PR process note as well - we do seem to be a bit inconsistent as a company in how/whether we use "request for changes" but I see your point and agree that a comment without explicit approval one way or the other would have been appropriate here.

"""Retrieve PREMIS original name from a METSDocument object.

If the original name cannot be reliably retrieved from the METS file
a METSError is exception is returned to be handled by the called as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a typo in "METSError is exception is returned"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good spot. Thanks Tessa.

Having used AIPscan a little more, leaving the name blank is strange
to look at, but we also cannot always retrieve the original name. We
do our best to a) retrieve the name in this commit, as well as b)
populate the name with the transfer UUID as an alternative.
@ross-spencer ross-spencer merged commit 9a1e708 into main Sep 30, 2020
@ross-spencer ross-spencer deleted the dev/issue-57-transfer-name branch September 30, 2020 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants