Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allotted length: maximum length of the variable #91

Closed
2 tasks
cpiraux opened this issue Apr 20, 2023 · 14 comments · Fixed by #194 or #213
Closed
2 tasks

Allotted length: maximum length of the variable #91

cpiraux opened this issue Apr 20, 2023 · 14 comments · Fixed by #194 or #213
Assignees

Comments

@cpiraux
Copy link
Collaborator

cpiraux commented Apr 20, 2023

The length for the submitted data should be set to the maximum length of the variable across all datasets per FDA Study Data Technical Conformance Guide.

In the CDISC webinar ‘Define-XML Office Hours’, they also mentioned that the length in metadata (planned length) might be different than the one in dataset (actual length)

image

How xportr_length() could take this into account? See below discsussion

Definition of Done

  • xportr_length() Provides the option to select either metadata length or data length. If the length is set to 'data,' each variable's length will be the maximum length of the data, regardless of the value in the metadata.
  • Include a message in xport_length() to alert the user when the data length is shorter than the length specified in the metadata. This will allow the user to update the metadata length if they wish to reduce the file size.
@bms63
Copy link
Collaborator

bms63 commented Apr 20, 2023

  • Check and review warning around length > 200?

Look in xpt_validate() https://github.com/atorus-research/xportr/blob/cd4da91e9e05e5d846567bdeb6fdad43a8b61fec/R/utils-xportr.R

@kaz462
Copy link
Collaborator

kaz462 commented Apr 27, 2023

How xportr_length() could take this into account?

Maybe add documentation on how the two sources of length (data-driven and spec-driven) is prioritized?

  • Data-driven length: it seems write_xpt/xportr_write will automate the length if width attribute is not applied. I think the automated length is the max length of a character variable - need further confirmation.

  • Spec-driven length: from the define.xml

  • If spec length != data-driven length, the longer length will be applied

Users can either specify the length in define.xml or leave it blank for data-driven length.

@bms63
Copy link
Collaborator

bms63 commented Apr 30, 2023

I like the idea of highlighting this in the documentation for the function. It is pretty minimal.
image

@cpiraux
Copy link
Collaborator Author

cpiraux commented May 2, 2023

I did some testing by creating xpt files and opening them with SAS.

• When attribute ‘width’ is missing or width = NA then the maximum length of the variable is used for length

• When the length in attribute ‘width’ is shorter than the maximum length of the variable, there is a warning message from haven and the maximum length is taken.

When variable have Y and NA, ‘haven’ consider that the length of the variable is 2. Should I add a GitHub issue for this to not consider NA for length?

image
image

Should we keep the warning message from haven or have our own check for this?

• I did a test with a variable containing a value longer than 200 characters and an xpt was created without any issue and it was not truncated in SAS.
image

The maximum length variable should be 200, I suggest adding a check to make sure that length of variable is <= 200

• Per define.xml specification, for SDTM date variable --DTC the length should be missing
image

I suggest keeping attribute ‘width’ missing so the length is handled by haven or to set the length to 19. What do you think?

• I also added a Github issue in Metatools package to suggest a function to check/update metadata taking in consideration the maximum length of variable in data. pharmaverse/metatools#53

@kaz462
Copy link
Collaborator

kaz462 commented May 11, 2023

When variable have Y and NA, ‘haven’ consider that the length of the variable is 2. Should I add a GitHub issue for this to not consider NA for length?

This issue has been fixed: tidyverse/haven#699

It would be great to have the max length warning as @bms63 @cpiraux mentioned:

Check and review warning around length > 200?

The maximum length variable should be 200, I suggest adding a check to make sure that length of variable is <= 200

@cpiraux
Copy link
Collaborator Author

cpiraux commented May 23, 2023

When variable have Y and NA, ‘haven’ consider that the length of the variable is 2. Should I add a GitHub issue for this to not consider NA for length?

This issue has been fixed: tidyverse/haven#699

I have tested this with the latest version of haven, and I no longer encounter the issue with NA

Additional suggestion:
For character variables, the length is defaulted to 200, and for numeric variables, it is defaulted to 8, only if the variable is missing in the metadata. If the length is missing in the metadata, the 'width' attribute is set to NA. I suggest adding in the message the value to which the length is imputed.

Picture1

Additionally, we could consider adding an option to allow users to choose whether or not to impute the length. If the length is not imputed, haven can provide the maximum length of the value in that case.

@siye6
Copy link

siye6 commented Nov 17, 2023

if there are special characters in the value of variable, the length of the variable will exceed 200. would like to echo @bms63 @cpiraux

Check and review warning around length > 200?
The maximum length variable should be 200, I suggest adding a check to make sure that length of variable is <= 200

@cpiraux
Copy link
Collaborator Author

cpiraux commented Nov 17, 2023

if there are special characters in the value of variable, the length of the variable will exceed 200. would like to echo @bms63 @cpiraux

Check and review warning around length > 200?
The maximum length variable should be 200, I suggest adding a check to make sure that length of variable is <= 200

That's a good question. The limit of XPT v5 is 200 characters for the variable, which probably corresponds to 200 bytes. If special characters are used, then we need to take into account the number of bytes per character.

@bms63
Copy link
Collaborator

bms63 commented Nov 17, 2023

@siye6 could you make a reproducible example of this please? We are in active development so can most likely incorporate this into the next release.

@siye6
Copy link

siye6 commented Nov 18, 2023

Hi @bms63 , thanks for asking, sure, please see below simple example, xportr_write() call haven::write_xpt so I just use haven::write_xpt in the example.

text199 <- stringi::stri_dup('x',199)
text200 <- paste0(text199,'‘')

nchar(text200)
nchar(text200, type = 'bytes')

t <- data.frame( test = text200, id = c(1))
haven::write_xpt(t, "t.xpt", version = 5, name = "t")
> nchar(text200)
[1] 200
> nchar(text200, type = 'bytes')
[1] 202
> 
> t <- data.frame( test = text200, id = c(1))
> haven::write_xpt(t, "t.xpt", version = 5, name = "t")

And the screenshot of t.xpt
image

@bms63
Copy link
Collaborator

bms63 commented Nov 30, 2023

From discussion in a PR

I see that the issue #91 is also tagged but there are other elements to consider regarding the length in this issue. We could have another PR for it or do it in this one; I am not sure of the best approach, but I prefer the issue not to be closed until the following points are handled:
There are two types of length:

Metadata length
Data length (maximum value length)
What to do in case these lengths are different:

Data length > metadata length: This will cause the truncation of the data. This case is handled by Haven. The length is the one from the data, and a message is given.
Data length < metadata length: In that case, to reduce the file size, the FDA requests trimming the variable length. In this case, we could give a message so the user can update the metadata length.
Metadata length is missing: For character variables, should we assign the data length or 200? In order to trim the variable, I suggest using the data length. Per define.xml specification, the --DTC variables should have the length missing in define.xml, so this scenario is probable.
Note:
In theory, the metadata length corresponds to the planned length and might be different from the data length. In practice, the metadata and the data length are usually similar.

The FDA requests trimming the variable across datasets. For example, if AVISIT data length = 40 in ADLB and AVISIT length = 30 in ADVS, the AVISIT length should be 40 for all datasets. In this case, the metadata length will be different from the data length. I don’t think we can check this scenario, but maybe we could mention this point in the documentation.

@kaz462
Copy link
Collaborator

kaz462 commented Dec 7, 2023

Metadata length is missing: For character variables, should we assign the data length or 200? In order to trim the variable, I suggest using the data length.

Agree with @cpiraux to use the data length when metadata length is missing, currently it's imputed as 200

@bms63 bms63 reopened this Dec 15, 2023
@cpiraux
Copy link
Collaborator Author

cpiraux commented Dec 18, 2023

I have listed below some possibilities to implement the maximum length in xportr:

  • Provide the option to select either metadata length or data length in xport_length(). If the length is set to 'data,' each variable's length will be the maximum length of the data, regardless of the value in the metadata.
  • Include a message in xport_length() to alert the user when the data length is shorter than the length specified in the metadata. This will allow the user to update the metadata length if they wish to reduce the file size.
  • Create xpt length with maximum length derived from the data regardless of the metadata. Provide a dataframe with the calculated length, allowing the user to update the metadata length accordingly.
  • Implement a separate function to calculate the maximum length, enabling the user to update the metadata. The updated metadata length will be used in xport_length().

@cpiraux
Copy link
Collaborator Author

cpiraux commented Dec 18, 2023

As an example, the process for length that I have already experienced is:

  1. Length from the metadata repository.
  2. Assign this length to the variables in SAS during programming.
  3. At the end of the program or before submission package (all XPTs created in one run), trim the variables to the maximum length.
  4. Automatically update the specifications for define.xml with the maximum length.

@cpiraux cpiraux linked a pull request Jan 10, 2024 that will close this issue
14 tasks
bms63 added a commit that referenced this issue Jan 28, 2024
bms63 added a commit that referenced this issue Jan 29, 2024
Merge remote-tracking branch 'origin/main' into 91-max-length

# Conflicts:
#	R/length.R
#	man/metadata.Rd
#	man/xportr_length.Rd
bms63 added a commit that referenced this issue Jan 29, 2024
bms63 added a commit that referenced this issue Jan 29, 2024
bms63 added a commit that referenced this issue Feb 11, 2024
Closes #91 length attribute from max data length
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants