-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #268 easier user experience for splitting datasets #272
Conversation
Hi @bms63 , I pushed an intial draft of an updated xportr_write function. My unit test in test-write.R is failing the CMD check but when I run it in my local R studio, it passes so I was wondering if you could take a look into this during your review? Thanks. |
Hi @bms63 , I'm not sure about the test_large_files test since I didn't code it and it was already set to not TRUE. In my unit test, I used the adlb dataset from pharmaverseadam and I suspect that is the reason why it's failing the CMD check but I included pharmaverseadam in the DESCRIPTION file and also referenced library(pharmaverseadam) in the testthat.R file. |
@adchan11 I'm not sure on this failing windows check. @averissimo @vedhav @elimillera @EeethB do you all see any issues happening in this check? thanks for taking a peek if you are able!! |
amazing!! thanks @vedhav !! |
I think we are good - @elimillera can you send this to CRAN? do you want it in main or in this branch? |
Sounds good. I think we can push to the main branch then I'll push out and make the release once its accepted. I'll also do a couple of checks today to make sure we're all set. |
#' | ||
#' @noRd | ||
|
||
export_to_xpt <- function(.df, path, max_size_gb, file_prefix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the size of the xpt file doesn't exceed the maximum size, no counter should be appended to the filename.
The results of the function are correct. However, the performance could be improved. If the file size is smaller than the maximum size, the file is written log(nr_rows)
times.
Maybe we could write the complete file first and if it exceeds the maximum size, an estimate for the number of rows is calculated based on the file size and the number of rows. If one of the parts exceeds the maximum size, the estimate is adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should optimize the code. For testing I used an ADLB dataset which written as xpt file has 400MB. I wrote the dataset with different values for max_size_gb
:
> system.time(xportr_write(adlb, "adlb.xpt"))
user system elapsed
25.730 2.258 27.857
> system.time(xportr_write(adlb, "adlb.xpt", max_size_gb = 1))
Data frame exported to 1 xpt files.
user system elapsed
50.154 22.826 73.497
> system.time(xportr_write(adlb, "adlb.xpt", max_size_gb = 0.3))
Data frame exported to 2 xpt files.
user system elapsed
247.439 25.566 275.742
@adchan11 , @bms63 , @rossfarrugia , what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less of a concern from my side as most commonly would be ran with 4, 5, or 10 (so creating less splits than your examples above) and usually just a one time job at the end only impacting certain large datasets, so not like people will be running this often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super worried about optimizing either. Hopefully, xpts go away in the next couple of years!
R/write.R
Outdated
@@ -73,6 +76,7 @@ xportr_write <- function(.df, | |||
|
|||
assert_data_frame(.df) | |||
assert_string(path) | |||
checkmate::assert_numeric(max_size_gb, null.ok = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkmate::assert_numeric(max_size_gb, null.ok = TRUE) | |
assert_numeric(max_size_gb, null.ok = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bms63 For some reason, removing checkmate::
results in errors in the CMD checks. I took a look but not sure why. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to get placed in this file with its other friends - https://github.com/atorus-research/xportr/blob/main/R/xportr-package.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated.
@bundfussr are you happy with your requested changes and thanks for being so thorough!! |
I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the |
Thanks, I will work on this over the next few days. |
Hi @bundfussr, I pushed updates to deprecate the function and improve documentation. I took some guidelines on how to deprecate functions using these sources as I haven't done this before: https://github.com/tidyverse/dplyr/blob/HEAD/R/deprec-lazyeval.R Hopefully this follows best practices for deprecation but please let me know your feedback. Thanks. |
sorry @adchan11 we have a deprecation process in our Wiki https://github.com/atorus-research/xportr/wiki/Deprecation-Process. It is similar to admiral's process |
Thanks @bms63. I've pushed updates as per the deprecation process you've linked. The CMD checks now pass except one regarding validating links. I reviewed it and the links work for me but I'm not sure why they're not rendering properly in the vignettes. It's only the fda.gov links. FYI: @bundfussr ready for review if you have any additional feedback. Please note, I will be OOO for the next 2 weeks and back on Sept 30. I will review again on Sept 30 when I'm back if there are any changes I need to push. FYI @rossfarrugia |
xportr_split <- function(.df, split_by = NULL) { | ||
attr(.df, "_xportr.split_by_") <- split_by | ||
lifecycle::deprecate_warn( | ||
when = "0.5.0", | ||
what = "xportr_split()", | ||
with = "xportr_write()", | ||
details = "Please use the argument `max_gb_size` in the | ||
function xportr_write() instead` instead." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this! ty for doing this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, the deprecated functionality is still available in Phase 1 (deprecation warning). However, here xportr_split()
still works as before but in xportr_write()
it is ignored. I.e., the functionality is already removed.
I would either go directly to Phase 2 (deprecation error) or keep the old functionality (in addition to the new functionality) in xport_write()
. I assume xportr_split()
is used only by a few users (if at all). Thus I would tend to the first option.
@bms63 , what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me - we can finish up when you return @adchan11 unless @rossfarrugia you need this now?
The links are a perpetual pain
Should be fine for us, thanks! FYI @millerg23 - If this goes back to CRAN early Oct then we should be able to squeeze in implementing it with our next admiralroche release. Although if @bundfussr agrees his comments are resolved now then we wouldn't need to wait for Adrian and could push ahead anytime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rossfarrugia if needed, I can try and do final review EOW and we can push to CRAN. @elimillera do you have time to review as well?
Honestly, feels like this is in pretty good shape
thanks @bms63 - if no extra review comments needing Adrian's input then yes i'd agree. that'd help us as we do have one team already lined up that will be wanting to use this new functionality. |
The new functionality is OK from my side. Just the deprecation of the old functionality deviates from the usual process. |
whew!!! two weeks goes fast. this is a perpetual item on my todo list. Still trying to get to it! |
@bms63 - @bundfussr has offered to make a commit to make the final updates for his review comments. Hopefully this'll help in pushing the release out and getting it off to CRAN |
@bms63 , I have a commit ready for finalizing the deprecation. Could you grant we write access to the repo? Then I can push it. |
@bms63 , should we create a |
Bummer looks like those Phuse sites are dead now...they were very helpful. |
Thank you for your Pull Request!
We have developed a Pull Request template to aid you and our reviewers. Completing the below tasks helps to ensure our reviewers can maximize their time on your code as well as making sure the xportr codebase remains robust and consistent.
The scope of
{xportr}
{xportr}
's scope is to enable R users to write out submission compliantxpt
files that can be delivered to a Health Authority or to downstream validation software programs. We see labels, lengths, types, ordering and formats from a dataset specification object (SDTM and ADaM) as being our primary focus. We also see messaging and warnings to users around applying information from the specification file as a primary focus. Please make sure your Pull Request meets this scope of {xportr}. If your Pull Request moves beyond this scope, please get in touch with the{xportr}
team on slack or create an issue to discuss.Please check off each task box as an acknowledgment that you completed the task. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the
main
branch until you have checked off each task.Changes Description
Updated xportr_write with new function and parameter to split data frame based on specified maximum file size. Also updated unit testing to test ability to split data frame and ensure file size of exported files.
Task List
styler
package and functions to style files accordingly.devtools::document()
so all.Rd
files in theman
folder and theNAMESPACE
file in the project root are updated appropriatelypkgdown::build_site()
and check that all affected examples are displayed correctly and that all new/updated functions occur on the "Reference" page.NEWS.md
if the changes pertain to a user-facing function (i.e. it has an@export
tag) or documentation aimed at users (rather than developers)NEWS.md
entry should go under the# xportr development version
section. Don't worry about updating the version because it will be auto-updated using thevbump.yaml
CI.