Closes #268 easier user experience for splitting datasets #272

adchan11 · 2024-08-19T20:40:49Z

Thank you for your Pull Request!

We have developed a Pull Request template to aid you and our reviewers. Completing the below tasks helps to ensure our reviewers can maximize their time on your code as well as making sure the xportr codebase remains robust and consistent.

The scope of `{xportr}`

{xportr}'s scope is to enable R users to write out submission compliant xpt files that can be delivered to a Health Authority or to downstream validation software programs. We see labels, lengths, types, ordering and formats from a dataset specification object (SDTM and ADaM) as being our primary focus. We also see messaging and warnings to users around applying information from the specification file as a primary focus. Please make sure your Pull Request meets this scope of {xportr}. If your Pull Request moves beyond this scope, please get in touch with the {xportr} team on slack or create an issue to discuss.

Please check off each task box as an acknowledgment that you completed the task. This checklist is part of the Github Action workflows and the Pull Request will not be merged into the main branch until you have checked off each task.

Changes Description

Updated xportr_write with new function and parameter to split data frame based on specified maximum file size. Also updated unit testing to test ability to split data frame and ensure file size of exported files.

Task List

github-actions · 2024-08-20T14:02:58Z

Package	Line Rate	Health
xportr	100%	✔
Summary	100% (835 / 836)	✔

adchan11 · 2024-08-20T14:33:49Z

Hi @bms63 , I pushed an intial draft of an updated xportr_write function. My unit test in test-write.R is failing the CMD check but when I run it in my local R studio, it passes so I was wondering if you could take a look into this during your review? Thanks.

bms63 · 2024-08-20T15:01:13Z

Is there some setting that needs to be set here with the testing large files?? Maybe this shouldn't be enabled for CRAN...not sure if they would allow it??

adchan11 · 2024-08-21T15:19:18Z

Is there some setting that needs to be set here with the testing large files?? Maybe this shouldn't be enabled for CRAN...not sure if they would allow it??

Hi @bms63 , I'm not sure about the test_large_files test since I didn't code it and it was already set to not TRUE. In my unit test, I used the adlb dataset from pharmaverseadam and I suspect that is the reason why it's failing the CMD check but I included pharmaverseadam in the DESCRIPTION file and also referenced library(pharmaverseadam) in the testthat.R file.

tests/testthat.R

bms63 · 2024-08-22T22:01:35Z

@adchan11 I'm not sure on this failing windows check.

@averissimo @vedhav @elimillera @EeethB do you all see any issues happening in this check? thanks for taking a peek if you are able!!

R/write.R

bms63 · 2024-08-23T13:02:00Z

amazing!! thanks @vedhav !!

adchan11 · 2024-09-03T14:53:55Z

amazing!! thanks @vedhav !!

Thanks @vedhav ! Just wanted to follow up @bms63, if there are any other changes needed, now that all CMD checks passed?

bms63 · 2024-09-03T15:37:59Z

I think we are good - @elimillera can you send this to CRAN? do you want it in main or in this branch?

DESCRIPTION

NEWS.md

elimillera · 2024-09-05T11:55:49Z

I think we are good - @elimillera can you send this to CRAN? do you want it in main or in this branch?

Sounds good. I think we can push to the main branch then I'll push out and make the release once its accepted. I'll also do a couple of checks today to make sure we're all set.

R/write.R

bundfussr · 2024-09-05T12:05:18Z

R/write.R

+#'
+#' @noRd
+
+export_to_xpt <- function(.df, path, max_size_gb, file_prefix) {


If the size of the xpt file doesn't exceed the maximum size, no counter should be appended to the filename.

The results of the function are correct. However, the performance could be improved. If the file size is smaller than the maximum size, the file is written log(nr_rows) times.

Maybe we could write the complete file first and if it exceeds the maximum size, an estimate for the number of rows is calculated based on the file size and the number of rows. If one of the parts exceeds the maximum size, the estimate is adjusted.

Thanks, updated.

I'm not sure if we should optimize the code. For testing I used an ADLB dataset which written as xpt file has 400MB. I wrote the dataset with different values for max_size_gb:

> system.time(xportr_write(adlb, "adlb.xpt")) user system elapsed 25.730 2.258 27.857 > system.time(xportr_write(adlb, "adlb.xpt", max_size_gb = 1)) Data frame exported to 1 xpt files. user system elapsed 50.154 22.826 73.497 > system.time(xportr_write(adlb, "adlb.xpt", max_size_gb = 0.3)) Data frame exported to 2 xpt files. user system elapsed 247.439 25.566 275.742

@adchan11 , @bms63 , @rossfarrugia , what do you think?

Less of a concern from my side as most commonly would be ran with 4, 5, or 10 (so creating less splits than your examples above) and usually just a one time job at the end only impacting certain large datasets, so not like people will be running this often.

I'm not super worried about optimizing either. Hopefully, xpts go away in the next couple of years!

bms63 · 2024-09-06T18:14:53Z

R/write.R

@@ -73,6 +76,7 @@ xportr_write <- function(.df,

  assert_data_frame(.df)
  assert_string(path)
+  checkmate::assert_numeric(max_size_gb, null.ok = TRUE)


Suggested change

checkmate::assert_numeric(max_size_gb, null.ok = TRUE)

assert_numeric(max_size_gb, null.ok = TRUE)

Thanks, updated

@bms63 For some reason, removing checkmate:: results in errors in the CMD checks. I took a look but not sure why. Do you have any suggestions?

It needs to get placed in this file with its other friends - https://github.com/atorus-research/xportr/blob/main/R/xportr-package.R

Thanks, updated.

bms63 · 2024-09-09T18:29:33Z

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

bundfussr · 2024-09-10T10:50:31Z

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

adchan11 · 2024-09-10T19:27:50Z

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Thanks, I will work on this over the next few days.

adchan11 · 2024-09-11T18:16:59Z

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Hi @bundfussr, I pushed updates to deprecate the function and improve documentation. I took some guidelines on how to deprecate functions using these sources as I haven't done this before:

https://github.com/tidyverse/dplyr/blob/HEAD/R/deprec-lazyeval.R
https://contributions.bioconductor.org/deprecation.html
https://dplyr.tidyverse.org/reference/se-deprecated.html

Hopefully this follows best practices for deprecation but please let me know your feedback. Thanks.

bms63 · 2024-09-11T18:21:18Z

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Hi @bundfussr, I pushed updates to deprecate the function and improve documentation. I took some guidelines on how to deprecate functions using these sources as I haven't done this before:

https://github.com/tidyverse/dplyr/blob/HEAD/R/deprec-lazyeval.R https://contributions.bioconductor.org/deprecation.html https://dplyr.tidyverse.org/reference/se-deprecated.html

Hopefully this follows best practices for deprecation but please let me know your feedback. Thanks.

sorry @adchan11 we have a deprecation process in our Wiki https://github.com/atorus-research/xportr/wiki/Deprecation-Process. It is similar to admiral's process

adchan11 · 2024-09-13T18:44:02Z

@bundfussr are you happy with your requested changes and thanks for being so thorough!!

I don't know which deprecation strategy you use in xportr. However, in addition to the note in the Changelog I would add a note to the xportr_split() documentation and issue an error if xport_split() is called as the functionality is no longer available.

Hi @bundfussr, I pushed updates to deprecate the function and improve documentation. I took some guidelines on how to deprecate functions using these sources as I haven't done this before:
https://github.com/tidyverse/dplyr/blob/HEAD/R/deprec-lazyeval.R https://contributions.bioconductor.org/deprecation.html https://dplyr.tidyverse.org/reference/se-deprecated.html
Hopefully this follows best practices for deprecation but please let me know your feedback. Thanks.

sorry @adchan11 we have a deprecation process in our Wiki https://github.com/atorus-research/xportr/wiki/Deprecation-Process. It is similar to admiral's process

Thanks @bms63. I've pushed updates as per the deprecation process you've linked. The CMD checks now pass except one regarding validating links. I reviewed it and the links work for me but I'm not sure why they're not rendering properly in the vignettes. It's only the fda.gov links.

FYI: @bundfussr ready for review if you have any additional feedback.

Please note, I will be OOO for the next 2 weeks and back on Sept 30. I will review again on Sept 30 when I'm back if there are any changes I need to push. FYI @rossfarrugia

bms63 · 2024-09-13T19:49:00Z

R/split.R

 xportr_split <- function(.df, split_by = NULL) {
-  attr(.df, "_xportr.split_by_") <- split_by
+  lifecycle::deprecate_warn(
+    when = "0.5.0",
+    what = "xportr_split()",
+    with = "xportr_write()",
+    details = "Please use the argument `max_gb_size` in the
+    function xportr_write() instead` instead."
+  )


love this! ty for doing this!

Usually, the deprecated functionality is still available in Phase 1 (deprecation warning). However, here xportr_split() still works as before but in xportr_write() it is ignored. I.e., the functionality is already removed.

I would either go directly to Phase 2 (deprecation error) or keep the old functionality (in addition to the new functionality) in xport_write(). I assume xportr_split() is used only by a few users (if at all). Thus I would tend to the first option.

@bms63 , what do you think?

bms63

This looks good to me - we can finish up when you return @adchan11 unless @rossfarrugia you need this now?

The links are a perpetual pain

rossfarrugia · 2024-09-16T06:48:37Z

This looks good to me - we can finish up when you return @adchan11 unless @rossfarrugia you need this now?

Should be fine for us, thanks! FYI @millerg23 - If this goes back to CRAN early Oct then we should be able to squeeze in implementing it with our next admiralroche release. Although if @bundfussr agrees his comments are resolved now then we wouldn't need to wait for Adrian and could push ahead anytime.

bms63

@rossfarrugia if needed, I can try and do final review EOW and we can push to CRAN. @elimillera do you have time to review as well?

Honestly, feels like this is in pretty good shape

rossfarrugia · 2024-09-16T11:34:57Z

thanks @bms63 - if no extra review comments needing Adrian's input then yes i'd agree. that'd help us as we do have one team already lined up that will be wanting to use this new functionality.

bundfussr · 2024-09-17T08:20:35Z

Although if @bundfussr agrees his comments are resolved now then we wouldn't need to wait for Adrian and could push ahead anytime.

The new functionality is OK from my side. Just the deprecation of the old functionality deviates from the usual process.

bms63 · 2024-09-27T01:12:53Z

whew!!! two weeks goes fast. this is a perpetual item on my todo list. Still trying to get to it!

rossfarrugia · 2024-09-27T08:10:38Z

@bms63 - @bundfussr has offered to make a commit to make the final updates for his review comments. Hopefully this'll help in pushing the release out and getting it off to CRAN

bundfussr · 2024-09-27T08:42:34Z

@bms63 , I have a commit ready for finalizing the deprecation. Could you grant we write access to the repo? Then I can push it.

bundfussr · 2024-09-27T11:07:20Z

@bms63 , should we create a .lycheeignore file to fix the failing links check or should we just ignore it?

bms63 · 2024-09-27T13:07:46Z

Bummer looks like those Phuse sites are dead now...they were very helpful.

first draft

c47bbbd

adchan11 linked an issue Aug 20, 2024 that may be closed by this pull request

Feature Request: xportr_split() enhancement for easier user experience #268

Closed

updates to fix cmd checks

49e242e

Adrian Chan added 2 commits August 20, 2024 14:19

updates

00a5b28

update

dde841b

adchan11 requested a review from bms63 August 20, 2024 14:32

bms63 reviewed Aug 21, 2024

View reviewed changes

tests/testthat.R Outdated Show resolved Hide resolved

removed library

60b3880

chore: fix broken test due to different dir path in windows

eb84a51

vedhav reviewed Aug 23, 2024

View reviewed changes

R/write.R Outdated Show resolved Hide resolved

chore: fix spellcheck ci

e1213d4

rossfarrugia suggested changes Sep 4, 2024

View reviewed changes

DESCRIPTION Show resolved Hide resolved

NEWS.md Show resolved Hide resolved

bundfussr requested changes Sep 5, 2024

View reviewed changes

updates

8c5a2c7

bms63 reviewed Sep 6, 2024

View reviewed changes

updates

b9211c8

updates

791d032

updates for deprecated function

7967a51

updates

a18cda9

bms63 reviewed Sep 13, 2024

View reviewed changes

bms63 approved these changes Sep 13, 2024

View reviewed changes

bms63 reviewed Sep 16, 2024

View reviewed changes

bundfussr added 3 commits September 27, 2024 08:27

#268 split: reactivate old split functionality

c57c69a

#268 split: style files

dbce55e

#268 split: update documentation

a15a5fb

bundfussr approved these changes Sep 27, 2024

View reviewed changes

chore: add .lycheeignore to ignore bad links

3a801e2

bms63 changed the title ~~Closes #268~~ Closes #268 easier user experience for splitting datasets Sep 27, 2024

chore: #268 links again!

d4ba694

bms63 merged commit 5df5c45 into main Sep 27, 2024
14 of 15 checks passed

bms63 deleted the issue_268_split branch September 27, 2024 13:10

	checkmate::assert_numeric(max_size_gb, null.ok = TRUE)
	assert_numeric(max_size_gb, null.ok = TRUE)

Closes #268 easier user experience for splitting datasets #272

Closes #268 easier user experience for splitting datasets #272

Conversation

adchan11 commented Aug 19, 2024 • edited Loading

Thank you for your Pull Request!

The scope of {xportr}

Changes Description

Task List

github-actions bot commented Aug 20, 2024 • edited Loading

adchan11 commented Aug 20, 2024

bms63 commented Aug 20, 2024

adchan11 commented Aug 21, 2024

bms63 commented Aug 22, 2024

bms63 commented Aug 23, 2024

adchan11 commented Sep 3, 2024

bms63 commented Sep 3, 2024

elimillera commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bms63 commented Sep 9, 2024 • edited Loading

bundfussr commented Sep 10, 2024 • edited Loading

adchan11 commented Sep 10, 2024

adchan11 commented Sep 11, 2024

bms63 commented Sep 11, 2024

adchan11 commented Sep 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bms63 left a comment • edited Loading

Choose a reason for hiding this comment

rossfarrugia commented Sep 16, 2024

bms63 left a comment

Choose a reason for hiding this comment

rossfarrugia commented Sep 16, 2024

bundfussr commented Sep 17, 2024

bms63 commented Sep 27, 2024

rossfarrugia commented Sep 27, 2024

bundfussr commented Sep 27, 2024

bundfussr commented Sep 27, 2024

bms63 commented Sep 27, 2024

adchan11 commented Aug 19, 2024 •

edited

Loading

The scope of `{xportr}`

github-actions bot commented Aug 20, 2024 •

edited

Loading

bms63 commented Sep 9, 2024 •

edited

Loading

bundfussr commented Sep 10, 2024 •

edited

Loading

bms63 left a comment •

edited

Loading