Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #268 easier user experience for splitting datasets #272

Merged
merged 17 commits into from
Sep 27, 2024
Merged

Conversation

adchan11
Copy link
Collaborator

@adchan11 adchan11 commented Aug 19, 2024

Thank you for your Pull Request!

We have developed a Pull Request template to aid you and our reviewers. Completing the below tasks helps to ensure our reviewers can maximize their time on your code as well as making sure the xportr codebase remains robust and consistent.

The scope of {xportr}

{xportr}'s scope is to enable R users to write out submission compliant xpt files that can be delivered to a Health Authority or to downstream validation software programs. We see labels, lengths, types, ordering and formats from a dataset specification object (SDTM and ADaM) as being our primary focus. We also see messaging and warnings to users around applying information from the specification file as a primary focus. Please make sure your Pull Request meets this scope of {xportr}. If your Pull Request moves beyond this scope, please get in touch with the {xportr} team on slack or create an issue to discuss.

Please check off each task box as an acknowledgment that you completed the task. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the main branch until you have checked off each task.

Changes Description

Updated xportr_write with new function and parameter to split data frame based on specified maximum file size. Also updated unit testing to test ability to split data frame and ensure file size of exported files.

Task List

  • The spirit of xportr is met in your Pull Request
  • Place Closes #<insert_issue_number> into the beginning of your Pull Request Title (Use Edit button in top-right if you need to update)
  • Summary of changes filled out in the above Changes Description. Can be removed or left blank if changes are minor/self-explanatory.
  • Code is formatted according to the tidyverse style guide. Use styler package and functions to style files accordingly.
  • New functions or arguments follow established convention found in the Wiki.
  • Updated relevant unit tests or have written new unit tests. See our Wiki for conventions used in this package.
  • Creation/updated relevant roxygen headers and examples. See our Wiki for conventions used in this package.
  • Run devtools::document() so all .Rd files in the man folder and the NAMESPACE file in the project root are updated appropriately
  • Run pkgdown::build_site() and check that all affected examples are displayed correctly and that all new/updated functions occur on the "Reference" page.
  • Update NEWS.md if the changes pertain to a user-facing function (i.e. it has an @export tag) or documentation aimed at users (rather than developers)
  • The NEWS.md entry should go under the # xportr development version section. Don't worry about updating the version because it will be auto-updated using the vbump.yaml CI.
  • Address any updates needed for vignettes and/or templates.
  • Link the issue Development Panel so that it closes after successful merging.
  • The developer is responsible for fixing merge conflicts not the Reviewer.
  • Pat yourself on the back for a job well done! Much love to your accomplishment!

@adchan11 adchan11 linked an issue Aug 20, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Aug 20, 2024

Code Coverage

Package Line Rate Health
xportr 100%
Summary 100% (835 / 836)

@adchan11 adchan11 requested a review from bms63 August 20, 2024 14:32
@adchan11
Copy link
Collaborator Author

Hi @bms63 , I pushed an intial draft of an updated xportr_write function. My unit test in test-write.R is failing the CMD check but when I run it in my local R studio, it passes so I was wondering if you could take a look into this during your review? Thanks.

@bms63
Copy link
Collaborator

bms63 commented Aug 20, 2024

Is there some setting that needs to be set here with the testing large files?? Maybe this shouldn't be enabled for CRAN...not sure if they would allow it??

image

@adchan11
Copy link
Collaborator Author

Is there some setting that needs to be set here with the testing large files?? Maybe this shouldn't be enabled for CRAN...not sure if they would allow it??

image

Hi @bms63 , I'm not sure about the test_large_files test since I didn't code it and it was already set to not TRUE. In my unit test, I used the adlb dataset from pharmaverseadam and I suspect that is the reason why it's failing the CMD check but I included pharmaverseadam in the DESCRIPTION file and also referenced library(pharmaverseadam) in the testthat.R file.

tests/testthat.R Outdated Show resolved Hide resolved
@bms63
Copy link
Collaborator

bms63 commented Aug 22, 2024

@adchan11 I'm not sure on this failing windows check.

@averissimo @vedhav @elimillera @EeethB do you all see any issues happening in this check? thanks for taking a peek if you are able!!

R/write.R Outdated Show resolved Hide resolved
@bms63
Copy link
Collaborator

bms63 commented Aug 23, 2024

amazing!! thanks @vedhav !!

@adchan11
Copy link
Collaborator Author

adchan11 commented Sep 3, 2024

amazing!! thanks @vedhav !!

Thanks @vedhav ! Just wanted to follow up @bms63, if there are any other changes needed, now that all CMD checks passed?

@bms63
Copy link
Collaborator

bms63 commented Sep 3, 2024

I think we are good - @elimillera can you send this to CRAN? do you want it in main or in this branch?

DESCRIPTION Show resolved Hide resolved
NEWS.md Show resolved Hide resolved
@elimillera
Copy link
Member

I think we are good - @elimillera can you send this to CRAN? do you want it in main or in this branch?

Sounds good. I think we can push to the main branch then I'll push out and make the release once its accepted. I'll also do a couple of checks today to make sure we're all set.

R/write.R Outdated Show resolved Hide resolved
R/write.R Show resolved Hide resolved
R/write.R Show resolved Hide resolved
#'
#' @noRd

export_to_xpt <- function(.df, path, max_size_gb, file_prefix) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the size of the xpt file doesn't exceed the maximum size, no counter should be appended to the filename.

The results of the function are correct. However, the performance could be improved. If the file size is smaller than the maximum size, the file is written log(nr_rows) times.

Maybe we could write the complete file first and if it exceeds the maximum size, an estimate for the number of rows is calculated based on the file size and the number of rows. If one of the parts exceeds the maximum size, the estimate is adjusted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should optimize the code. For testing I used an ADLB dataset which written as xpt file has 400MB. I wrote the dataset with different values for max_size_gb:

> system.time(xportr_write(adlb, "adlb.xpt"))
   user  system elapsed 
 25.730   2.258  27.857 
> system.time(xportr_write(adlb, "adlb.xpt", max_size_gb = 1))
Data frame exported to 1 xpt files.
   user  system elapsed 
 50.154  22.826  73.497 
> system.time(xportr_write(adlb, "adlb.xpt", max_size_gb = 0.3))
Data frame exported to 2 xpt files.
   user  system elapsed 
247.439  25.566 275.742 

@adchan11 , @bms63 , @rossfarrugia , what do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less of a concern from my side as most commonly would be ran with 4, 5, or 10 (so creating less splits than your examples above) and usually just a one time job at the end only impacting certain large datasets, so not like people will be running this often.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super worried about optimizing either. Hopefully, xpts go away in the next couple of years!

R/write.R Outdated
@@ -73,6 +76,7 @@ xportr_write <- function(.df,

assert_data_frame(.df)
assert_string(path)
checkmate::assert_numeric(max_size_gb, null.ok = TRUE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
checkmate::assert_numeric(max_size_gb, null.ok = TRUE)
assert_numeric(max_size_gb, null.ok = TRUE)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bms63 For some reason, removing checkmate:: results in errors in the CMD checks. I took a look but not sure why. Do you have any suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to get placed in this file with its other friends - https://github.com/atorus-research/xportr/blob/main/R/xportr-package.R

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated.

@bms63
Copy link
Collaborator

bms63 commented Sep 9, 2024

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

@bundfussr
Copy link
Collaborator

bundfussr commented Sep 10, 2024

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

@adchan11
Copy link
Collaborator Author

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Thanks, I will work on this over the next few days.

@adchan11
Copy link
Collaborator Author

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Hi @bundfussr, I pushed updates to deprecate the function and improve documentation. I took some guidelines on how to deprecate functions using these sources as I haven't done this before:

https://github.com/tidyverse/dplyr/blob/HEAD/R/deprec-lazyeval.R
https://contributions.bioconductor.org/deprecation.html
https://dplyr.tidyverse.org/reference/se-deprecated.html

Hopefully this follows best practices for deprecation but please let me know your feedback. Thanks.

@bms63
Copy link
Collaborator

bms63 commented Sep 11, 2024

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Hi @bundfussr, I pushed updates to deprecate the function and improve documentation. I took some guidelines on how to deprecate functions using these sources as I haven't done this before:

https://github.com/tidyverse/dplyr/blob/HEAD/R/deprec-lazyeval.R https://contributions.bioconductor.org/deprecation.html https://dplyr.tidyverse.org/reference/se-deprecated.html

Hopefully this follows best practices for deprecation but please let me know your feedback. Thanks.

sorry @adchan11 we have a deprecation process in our Wiki https://github.com/atorus-research/xportr/wiki/Deprecation-Process. It is similar to admiral's process

@adchan11
Copy link
Collaborator Author

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Hi @bundfussr, I pushed updates to deprecate the function and improve documentation. I took some guidelines on how to deprecate functions using these sources as I haven't done this before:
https://github.com/tidyverse/dplyr/blob/HEAD/R/deprec-lazyeval.R https://contributions.bioconductor.org/deprecation.html https://dplyr.tidyverse.org/reference/se-deprecated.html
Hopefully this follows best practices for deprecation but please let me know your feedback. Thanks.

sorry @adchan11 we have a deprecation process in our Wiki https://github.com/atorus-research/xportr/wiki/Deprecation-Process. It is similar to admiral's process

Thanks @bms63. I've pushed updates as per the deprecation process you've linked. The CMD checks now pass except one regarding validating links. I reviewed it and the links work for me but I'm not sure why they're not rendering properly in the vignettes. It's only the fda.gov links.

FYI: @bundfussr ready for review if you have any additional feedback.

Please note, I will be OOO for the next 2 weeks and back on Sept 30. I will review again on Sept 30 when I'm back if there are any changes I need to push. FYI @rossfarrugia

Comment on lines 37 to +44
xportr_split <- function(.df, split_by = NULL) {
attr(.df, "_xportr.split_by_") <- split_by
lifecycle::deprecate_warn(
when = "0.5.0",
what = "xportr_split()",
with = "xportr_write()",
details = "Please use the argument `max_gb_size` in the
function xportr_write() instead` instead."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this! ty for doing this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, the deprecated functionality is still available in Phase 1 (deprecation warning). However, here xportr_split() still works as before but in xportr_write() it is ignored. I.e., the functionality is already removed.

I would either go directly to Phase 2 (deprecation error) or keep the old functionality (in addition to the new functionality) in xport_write(). I assume xportr_split() is used only by a few users (if at all). Thus I would tend to the first option.

@bms63 , what do you think?

Copy link
Collaborator

@bms63 bms63 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - we can finish up when you return @adchan11 unless @rossfarrugia you need this now?

The links are a perpetual pain

@rossfarrugia
Copy link

This looks good to me - we can finish up when you return @adchan11 unless @rossfarrugia you need this now?

Should be fine for us, thanks! FYI @millerg23 - If this goes back to CRAN early Oct then we should be able to squeeze in implementing it with our next admiralroche release. Although if @bundfussr agrees his comments are resolved now then we wouldn't need to wait for Adrian and could push ahead anytime.

Copy link
Collaborator

@bms63 bms63 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rossfarrugia if needed, I can try and do final review EOW and we can push to CRAN. @elimillera do you have time to review as well?

Honestly, feels like this is in pretty good shape

@rossfarrugia
Copy link

thanks @bms63 - if no extra review comments needing Adrian's input then yes i'd agree. that'd help us as we do have one team already lined up that will be wanting to use this new functionality.

@bundfussr
Copy link
Collaborator

Although if @bundfussr agrees his comments are resolved now then we wouldn't need to wait for Adrian and could push ahead anytime.

The new functionality is OK from my side. Just the deprecation of the old functionality deviates from the usual process.

@bms63
Copy link
Collaborator

bms63 commented Sep 27, 2024

whew!!! two weeks goes fast. this is a perpetual item on my todo list. Still trying to get to it!

@rossfarrugia
Copy link

@bms63 - @bundfussr has offered to make a commit to make the final updates for his review comments. Hopefully this'll help in pushing the release out and getting it off to CRAN

@bundfussr
Copy link
Collaborator

@bms63 , I have a commit ready for finalizing the deprecation. Could you grant we write access to the repo? Then I can push it.

@bundfussr
Copy link
Collaborator

@bms63 , should we create a .lycheeignore file to fix the failing links check or should we just ignore it?

@bms63 bms63 changed the title Closes #268 Closes #268 easier user experience for splitting datasets Sep 27, 2024
@bms63
Copy link
Collaborator

bms63 commented Sep 27, 2024

Bummer looks like those Phuse sites are dead now...they were very helpful.

@bms63 bms63 merged commit 5df5c45 into main Sep 27, 2024
14 of 15 checks passed
@bms63 bms63 deleted the issue_268_split branch September 27, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: xportr_split() enhancement for easier user experience
6 participants