Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Carry prov forward in certain scenarios #151

Open
wants to merge 407 commits into
base: main
Choose a base branch
from

Conversation

jeanetteclark
Copy link
Collaborator

see issue #131

isteves and others added 30 commits May 14, 2018 09:29
Added `cache: packages` back in
Add EML object as an alternative option for metadata_path
create pid_to_eml_entity function deprecate old functions
Added checks within function so it does not create dummy objects on a…
eml_otherEntity_to_dataTable function
Put the metadata check in the `check_first` block and left the resource map check as non-optional.
@jeanetteclark jeanetteclark requested a review from amoeba November 26, 2019 17:39
Copy link
Contributor

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. Especially great to see the tests I was looking for to verify the behavior. After looking closer, I see now what you were saying about provenance being forwarded when the data PIDs don't change, even by default. Missed it on the first go-round. I think that works.

Made a few comments, let me know what you think.

@@ -255,6 +255,7 @@ update_object <- function(mn, pid, path, format_id = NULL, new_pid = NULL, sid =
#' Checks that objects exist and are of the right format type. This speeds up the function, especially when `data_pids` has many elements.
#' @param format_id (character) Optional. When omitted, the updated object will have the same formatId as `metadata_pid`. If set, will attempt
#' to use the value instead.
#' @param keep_prov (logical) Option to force publish_update to keep prov
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing two leading spaces before @param and changing to:

Optional. Whether or not to retain provenance statements present in resource_map_pid in the updated resource map. Defaults to FALSE.

in order to match style.

R/editing.R Outdated
# Get the old resource map so we can extract any statements we need out of it
# such as PROV statements
old_resource_map_path <- tempfile()
writeLines(rawToChar(dataone::getObject(mn, resource_map_pid)), old_resource_map_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the call to dataone::getObject fails, the function will stop and show an error about why. This is probably fine since, at this point, no work has been done that needs to get cleaned up. Sound good?

other_statements)
}

prov_pids <- gsub("https://cn-stage-2.test.dataone.org/cn/v[0-9]/resolve/|https://cn.dataone.org/cn/v[0-9]/resolve/|https://cn-stage.test.dataone.org/cn/v[0-9]/resolve/",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for 99+% of cases, but it might be better if it was generalized to all environments: ^https?\:\/\/.+?\.(test\.)?org\/cn\/v\d+\/resolve\/.

prov_pids <- gsub("https://cn-stage-2.test.dataone.org/cn/v[0-9]/resolve/|https://cn.dataone.org/cn/v[0-9]/resolve/|https://cn-stage.test.dataone.org/cn/v[0-9]/resolve/",
"",
c(statements$subject, statements$object)) %>%
gsub("%3A", ":", .)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the PID portion of the resolve URI can have more than just URL-encoded ":" characters (like "/"), should this use URLdecode(x, reserved = TRUE) instead?

"",
c(statements$subject, statements$object)) %>%
gsub("%3A", ":", .)
prov_pids <- prov_pids[-(grep("^http", prov_pids))] %>% # might need to catch other things besides URLs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm understanding this part. Above, prov_pids should turn into a list of PIDs, possibly with dupes. Why do we need to filter it further before uniquifying?

# Create the replacement resource map
if (is.null(identifier)) {
identifier <- paste0("resource_map_", new_uuid())
}

new_rm_path <- generate_resource_map(metadata_pid = metadata_pid,
if (keep_prov == FALSE){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the call to generate_resource_map shows up four times here, in separate conditional branches. While it totally works, a slightly tidier pattern is to use the conditionals to generate the appropriates values for each parameter of generate_resource_map and call it once after the conditional block. The benefits are simplifying the conditional, and reducing any pain that may come later when if/when generate_resource_map is refactored which would cause use to have to update four call sites instead of one.

This'd look something like:

if (CONDITION) {
  other_statements = NULL
}

generate_resource_map(...) # All args, even NULL'd-out ones like `other_statements`

#' Get a data.frame of prov statements from a resource map pid.
#'
#' This is a function that is useful if you need to recover lost prov statements. It returns
#' a data.frame of statements that can be passed to `update_resource_map` in the `other_statements`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we wanna linkify these references?


expect_true(stringr::str_detect(url_new[2], new_data_pid))
})
prov_pids <- gsub("https://cn-stage-2.test.dataone.org/cn/v[0-9]/resolve/|https://cn.dataone.org/cn/v[0-9]/resolve/|https://cn-stage.test.dataone.org/cn/v[0-9]/resolve/", "", c(t$subject, t$object)) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this code is duplicated, might be good to make a function or two?

old_prov <- recover_prov(mn, rm_pid)
rm_new <- update_resource_map(mn, rm_pid, metadata_pid, data_pids, other_statements = old_prov, keep_prov = T)")

new_rm_path <- generate_resource_map(metadata_pid = metadata_pid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation is a tad off.

warning("Old provenance contains data pids not in new resource map. Provenance information will be removed. \n
You can get old provenance statements back using:
old_prov <- recover_prov(mn, rm_pid)
rm_new <- update_resource_map(mn, rm_pid, metadata_pid, data_pids, other_statements = old_prov, keep_prov = T)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing keep_prov = T to keep_prov = TRUE to encourage people away from using abbreviations which is pretty dangerous.

dmullen17 and others added 7 commits December 11, 2019 14:20
Co-Authored-By: Jeanette Clark <[email protected]>
return eml_validate check in eml_otherEntity_to_dataTable
often the same personnel are listed on multiple grants, this prevents them from being listed multiple times in the EML project section
remove duplicate personnel from NSF helper
Merge remote-tracking branch 'upstream/master' into carry_prov

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
Base automatically changed from master to main February 23, 2021 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.