Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for NA_units_ #163

Open
billdenney opened this issue Jul 25, 2018 · 18 comments
Open

Support for NA_units_ #163

billdenney opened this issue Jul 25, 2018 · 18 comments

Comments

@billdenney
Copy link
Contributor

billdenney commented Jul 25, 2018

I just had a bug in my code, but it was hard to track down due to an unclear error.

What I just did was, in essence, the following:

> set_units(NA_real_, NA_character_, mode="standard")
Error: is.language(x) is not TRUE

I thought that I was getting the error because I wasn't using mode="standard", but it really was due to the NA_character_.

I have two thoughts:

  1. If both the x and value options to set_units are NA, could NA be returned? (That could easily be a bad idea for several reasons, and I don't have a strong opinion if the answer is a resounding no.)
  2. Could the Error: is.language(x) is not TRUE error be clarified? I thought I was making it clear that I wasn't sending in a language object when I gave NA_character_, so the error didn't make sense to me. (Admittedly, I wasn't intentionally sending in NA_character_.)

A simple fix for option 2 above could possibly be changing the stopifnot(is.character(x), length(x) == 1) at the top of as_units.character to stopifnot(is.character(x), length(x) == 1, is.na(x)).

@billdenney
Copy link
Contributor Author

As I was fuzzing the interface after the above error, I found that this error could also be clarified:

set_units(NA_real_, NA, mode="standard")
Error in if (any(unclass(value) != 1)) warning(paste("numeric value",  : 
  missing value where TRUE/FALSE needed

@Enchufa2
Copy link
Member

Enchufa2 commented Jul 25, 2018

IMO, a missing value should be valid; a missing unit... has no meaning. So I would like to get an error, but definitively a more informative one. NAs and NULLs are tricky when it comes to if statements. Thanks for spotting it!

@edzer
Copy link
Member

edzer commented Jul 28, 2018

(sorry, committed this in a branch; will soon be merged)

@billdenney
Copy link
Contributor Author

As I'm working with more real data, I'm finding that this is a common issue for me. I will relatively often have a vector where there are instances of combined NA as both the magnitude and the units.

What do you think about defining a new constant, NA_units_ (similar to what is done in lubridate for dates).

In effect, NA_units_ would allow for NA as the value for the unit part while normal units would not allow that. It would require that the magnitude part is NA. All mathematical operations (+, -, *, /, and the % relatives) with NA_units_ would return NA_units_.

@Enchufa2
Copy link
Member

Enchufa2 commented Aug 1, 2018

Could you provide an example? Are there any instances of a non-NA number with a NA unit? Because, otherwise, we could simply support NA in Ops.units by returning NA if any operand is NA.

@billdenney
Copy link
Contributor Author

What I'm working with right now is clinical laboratory data in a clinical study. When a test isn't done for some reason (for example if the patient didn't come in for a scheduled visit), then the value and units for the test are explicitly missing (like d_missing below).

Immediately, I'm just trying to setup the data, and it looks like the following:

d_no_missing <-
  data.frame(Patient_ID=1:3,
             Lab_Name=c("Glucose", "Cholesterol", "Cholesterol"),
             Lab_Value=c(5, 100, 200),
             Lab_Units=c("mmol/L", "mg/dL", "mg/dL"),
             stringsAsFactors=FALSE)
d_missing <- d_no_missing
d_missing[2, c("Lab_Value", "Lab_Units")] <- NA

d_no_missing$Value_with_Units <-
  mixed_units(d_no_missing$Lab_Value,
              d_no_missing$Lab_Units,
              mode="standard")
d_missing$Value_with_Units <-
  mixed_units(d_missing$Lab_Value,
              d_missing$Lab_Units,
              mode="standard")

@Enchufa2
Copy link
Member

Enchufa2 commented Aug 1, 2018

I see. Then one path would be to enable missing units. It requires some rework of the functions that manage the units attribute, but everything else would work seamlessly. Another path would be to return plain NAs and make units work with them. Let's see what @edzer thinks.

@edzer
Copy link
Member

edzer commented Aug 4, 2018

Great thought. I see the convenience for your use case, but I'm not convinced we should require for NA_units_ units that the magnitude is only allowed to be NA. I see numeric values in units computations as actual values with implicitly missing units. If we allow for NA_units_,

> set_units(1, km/h) * 2
2 [km/h]

should actually return 2 [NA].

@Enchufa2
Copy link
Member

Enchufa2 commented Aug 4, 2018

I'm not convinced we should require for NA_units_ units that the magnitude is only allowed to be NA.

Me neither. If NA_units_ is allowed, we should go for the whole pack, and allow this to be just another (missing) unit. But

set_units(1, km/h) * 2 should actually return 2 [NA].

I'm not convinced about this. It is quite a change. Where is the boundary between unitless and missing units? It is not clear to me. I wouldn't change the behaviour based on a fuzzy boundary just to support a corner case. Moreover, unitless is a units concept, but missing is an R concept; so unitless is more general for this topic. Therefore, I would default to unitless (current behaviour) unless NA is explicitly specified.

In other words, for me, 2 (isolated, by default) has no units. Unless I have more information; unless I know that there is some unit, but I don't know which one it is, so I explicitly should have to specify set_units(2, NA).

@edzer
Copy link
Member

edzer commented Aug 4, 2018

I agree that would be quite a change, and I was thinking about making this optional. This

> set_units(1, km/h) * 1
1 [km/h]
> set_units(1, km/h) + 1
Error in Ops.units(set_units(1, km/h), 1) : 
  both operands of the expression should be "units" objects

however annoys me. unitless is a unit; no unit implies a missing unit. Having the option to catch (and break on) the case where plain numbers are elevated to unitless would be good. See here.

@Enchufa2
Copy link
Member

Enchufa2 commented Aug 4, 2018

unitless is a programming construct that is needed to deal with a programming issue: the case of losing units as a result of operations with units that cancel out. So it is not a unit, conceptually; it literally means no units.

Missing implies different things. Missing means that you know, actually, that there is some unit associated with that data, but it is not known or went missing for some reason. It is quite different to me.

@Enchufa2
Copy link
Member

Enchufa2 commented Aug 4, 2018

Oh, and I think that the error from set_units(1, km/h) + 1 should be the same as what you get from set_units(1, km/h) + set_units(1, 1), i.e.: Error: cannot convert 1 into km/h.

@billdenney
Copy link
Contributor Author

billdenney commented Aug 4, 2018

The conversation morphed a bit. I think that there are a few different items here:

  1. Values that have no units (like the number 2 as the example, 2 * 1 km = 2 km).
  2. Values with an unknown magnitude (of NA_real_) and known units, like NA km.
  3. Values with an unknown magnitude and unknown units.

The first is supported in the current library as is, and it should not change in my opinion.

The second is also already supported in the current library, too, and in my opinion, it should not change.

The third is not supported in the current library, and it supports several operations— mainly around the combination of mixed units. The example I gave above is my typical use case. Another example use case is c() with a mixed_units object and NA_real_ (or probably any NA).

I don’t think that it’s sensible to have a non-missing magnitude and an NA unit. For example, set_units(2, NA, mode=“standard”) seems like it should be an error, as it currently is. (There could be a use case that I’m not thinking of, but to me, that is what numeric classes are for.)

@edzer edzer closed this as completed in 8c6bda8 Aug 21, 2018
@Enchufa2
Copy link
Member

Automatically closed by 8c6bda8, which doesn't allow missing units, so I'm reopening this.

@paulponcet
Copy link

I would like to add the following use case to support the implementation of something like NA_unit_. If I read data in a table in a database, the table may (unfortunately, but that is real life) not be well documented, hence I may know the physical meaning of each column and not know the effective unit of each column. In that case, I would like to specify a missing unit for these columns, while waiting for more information from the database holder.

@paulponcet
Copy link

paulponcet commented Mar 21, 2020

Dear package maintainers, do you see any chance that this proposal be accepted sooner or later? I would see this notion of missing unit NA_unit as a valuable feature for this package, and it seems to me that the implementation would not be too difficult.

In my view it should first be clear that:

  • "unitless" and "missing unit" are two different things, as highlighted in some comments above.
  • a vector with a missing unit can actually be unitless.

Now, say we have a vector x (unitless, or with a non-missing unit, or with a missing unit) and a vector y with missing unit. Then:

  • x * y, y * x, x/y, y/x, x^y, y^x should have a missing unit;
  • x + y and x - y should not be allowed, since without a known unit we don't know whether x and y have the same physical dimension.

@Enchufa2
Copy link
Member

This is still open for two main reasons:

  1. The implementation is not so straightforward because udunits doesn't support missing units, so they need to be handled as a separate case in R, before pushing stuff to udunits. So it requires some care and additional testing to avoid breaking current behaviour.
  2. Re-reading the thread, I'd say we didn't reach consensus.

However, it's easier to gather consensus around something that works, so we could use a PR to refine the proposal. :)

@Enchufa2 Enchufa2 changed the title Unclear Error With Invalid Input to set_units Support for NA_units_ Mar 23, 2020
@billdenney
Copy link
Contributor Author

billdenney commented Oct 13, 2021

I'm coming back to this issue as I'm working to develop some additional unit conversion tools for laboratory data. I ahve a thought about implementation. What if we define a unit during package load called "NA_units_" as a new base unit:

While that would not require a specific magnitude, it would be a simple way to accomplish the goal of a unit value that is NA. Would that be an acceptable method? The downside I see to the above is that it would be confusing because there would be a unit named "NA_units_" and a constant defined named "NA_units_", but it would still work as expected, as far as I can see:

The next question would be: How should we handle calls to is.na()? Currently, only the magnitude part appears to be examined. Should we define is.na.units() to look for the units to be NA_units_ in addition to an NA magnitude? I'd suggest that we simply keep using the magnitude check and ignore the units part for is.na().

library(units)
#> udunits database from C:/Users/Bill Denney/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml
install_unit(symbol="NA_units_")
NA_units_ <- set_units(NA_real_, "NA_units_", mode="standard")

set_units(1, "NA_units_")
#> 1 [NA_units_]
set_units(2, NA_units_)
#> 2 [NA_units_]

Created on 2021-10-13 by the reprex package (v2.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants