-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Carry prov forward in certain scenarios #151
base: main
Are you sure you want to change the base?
Conversation
Added `cache: packages` back in
…tautils into publish_update2
Add EML object as an alternative option for metadata_path
create pid_to_eml_entity function deprecate old functions
Added checks within function so it does not create dummy objects on a…
eml_otherEntity_to_dataTable function
Update eml.R
Put the metadata check in the `check_first` block and left the resource map check as non-optional.
still some failure cases that aren't working
…last names less likely to discard information
add NSF award helper
fix bug if co-pis is empty in new helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good. Especially great to see the tests I was looking for to verify the behavior. After looking closer, I see now what you were saying about provenance being forwarded when the data PIDs don't change, even by default. Missed it on the first go-round. I think that works.
Made a few comments, let me know what you think.
@@ -255,6 +255,7 @@ update_object <- function(mn, pid, path, format_id = NULL, new_pid = NULL, sid = | |||
#' Checks that objects exist and are of the right format type. This speeds up the function, especially when `data_pids` has many elements. | |||
#' @param format_id (character) Optional. When omitted, the updated object will have the same formatId as `metadata_pid`. If set, will attempt | |||
#' to use the value instead. | |||
#' @param keep_prov (logical) Option to force publish_update to keep prov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest removing two leading spaces before @param
and changing to:
Optional. Whether or not to retain provenance statements present in
resource_map_pid
in the updated resource map. Defaults toFALSE
.
in order to match style.
R/editing.R
Outdated
# Get the old resource map so we can extract any statements we need out of it | ||
# such as PROV statements | ||
old_resource_map_path <- tempfile() | ||
writeLines(rawToChar(dataone::getObject(mn, resource_map_pid)), old_resource_map_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the call to dataone::getObject
fails, the function will stop and show an error about why. This is probably fine since, at this point, no work has been done that needs to get cleaned up. Sound good?
other_statements) | ||
} | ||
|
||
prov_pids <- gsub("https://cn-stage-2.test.dataone.org/cn/v[0-9]/resolve/|https://cn.dataone.org/cn/v[0-9]/resolve/|https://cn-stage.test.dataone.org/cn/v[0-9]/resolve/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for 99+% of cases, but it might be better if it was generalized to all environments: ^https?\:\/\/.+?\.(test\.)?org\/cn\/v\d+\/resolve\/
.
prov_pids <- gsub("https://cn-stage-2.test.dataone.org/cn/v[0-9]/resolve/|https://cn.dataone.org/cn/v[0-9]/resolve/|https://cn-stage.test.dataone.org/cn/v[0-9]/resolve/", | ||
"", | ||
c(statements$subject, statements$object)) %>% | ||
gsub("%3A", ":", .) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the PID
portion of the resolve URI can have more than just URL-encoded ":" characters (like "/"), should this use URLdecode(x, reserved = TRUE)
instead?
"", | ||
c(statements$subject, statements$object)) %>% | ||
gsub("%3A", ":", .) | ||
prov_pids <- prov_pids[-(grep("^http", prov_pids))] %>% # might need to catch other things besides URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I'm understanding this part. Above, prov_pids
should turn into a list of PIDs, possibly with dupes. Why do we need to filter it further before uniquifying?
# Create the replacement resource map | ||
if (is.null(identifier)) { | ||
identifier <- paste0("resource_map_", new_uuid()) | ||
} | ||
|
||
new_rm_path <- generate_resource_map(metadata_pid = metadata_pid, | ||
if (keep_prov == FALSE){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that the call to generate_resource_map
shows up four times here, in separate conditional branches. While it totally works, a slightly tidier pattern is to use the conditionals to generate the appropriates values for each parameter of generate_resource_map
and call it once after the conditional block. The benefits are simplifying the conditional, and reducing any pain that may come later when if/when generate_resource_map
is refactored which would cause use to have to update four call sites instead of one.
This'd look something like:
if (CONDITION) {
other_statements = NULL
}
generate_resource_map(...) # All args, even NULL'd-out ones like `other_statements`
#' Get a data.frame of prov statements from a resource map pid. | ||
#' | ||
#' This is a function that is useful if you need to recover lost prov statements. It returns | ||
#' a data.frame of statements that can be passed to `update_resource_map` in the `other_statements` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we wanna linkify these references?
|
||
expect_true(stringr::str_detect(url_new[2], new_data_pid)) | ||
}) | ||
prov_pids <- gsub("https://cn-stage-2.test.dataone.org/cn/v[0-9]/resolve/|https://cn.dataone.org/cn/v[0-9]/resolve/|https://cn-stage.test.dataone.org/cn/v[0-9]/resolve/", "", c(t$subject, t$object)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this code is duplicated, might be good to make a function or two?
old_prov <- recover_prov(mn, rm_pid) | ||
rm_new <- update_resource_map(mn, rm_pid, metadata_pid, data_pids, other_statements = old_prov, keep_prov = T)") | ||
|
||
new_rm_path <- generate_resource_map(metadata_pid = metadata_pid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation is a tad off.
warning("Old provenance contains data pids not in new resource map. Provenance information will be removed. \n | ||
You can get old provenance statements back using: | ||
old_prov <- recover_prov(mn, rm_pid) | ||
rm_new <- update_resource_map(mn, rm_pid, metadata_pid, data_pids, other_statements = old_prov, keep_prov = T)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about changing keep_prov = T
to keep_prov = TRUE
to encourage people away from using abbreviations which is pretty dangerous.
Co-Authored-By: Jeanette Clark <[email protected]>
return eml_validate check in eml_otherEntity_to_dataTable
often the same personnel are listed on multiple grants, this prevents them from being listed multiple times in the EML project section
remove duplicate personnel from NSF helper
Merge remote-tracking branch 'upstream/master' into carry_prov # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
see issue #131