Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deprecate install_conversion_constant #89

Closed
edzer opened this issue Jan 18, 2018 · 22 comments
Closed

deprecate install_conversion_constant #89

edzer opened this issue Jan 18, 2018 · 22 comments

Comments

@edzer
Copy link
Member

edzer commented Jan 18, 2018

It is clear from #71, #84, and my experiments in the udunits branch that user-defined units are better handled by the udunits2 C library than by R. I removed the R layer of installing user-defined units; this removes the flexibility to define any R function for conversion, which has the advantage that we don't have to verify that functions are meaningful (monotonous), which we didn't. Also, the udunits2 C api does give, I think, all meaningful operations, which we can now directly interface.

I propose to drop install_conversion_constant in favor of install_conversion_scale, which complements install_conversion_offset. The problem with _constant is that it doesn't make clear whether the constant refers to a scaling or an addition of a constant; _scale and _offset are used in the C API.

I also propose the new semantics:

install_conversion_scale("m", "thousandm", 1000)

creates thousandm where X thousandm equal 1000 X m. This is exactly opposite to what install_conversion_constant now does. Also, thousandm should not already be bound to a unit, and m should be an existing unit.

Similarly:

install_conversion_offset("K", "my_temp_C", 273.15)

creates the new my_temp_C as values in K + 273.15.

The function profile is (from,to,value), and returns to=from*value or to=from+value.

@Enchufa2
Copy link
Member

Are you planning to drop the udunits2 dependency in the next release?

@edzer
Copy link
Member Author

edzer commented Jan 18, 2018

If you and @t-kalinowski both agree this is a good idea, yes.

@Enchufa2
Copy link
Member

Completely agree. I would like to redesign the C/C++ part, but I may not have time depending on your roadmap for the next release. Anyway, as it is, it works, so the redesign can wait.

@edzer
Copy link
Member Author

edzer commented Jan 18, 2018

Agreed; something possibly more urgent now: I managed to get memory leaks to zero, but still see, when units (udunits branch) is loaded:

edzer@gin-pebesma:~/git$ R --no-restore --no-save -q -d valgrind
==24394== Memcheck, a memory error detector
==24394== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==24394== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==24394== Command: /usr/lib/R/bin/exec/R --no-restore --no-save -q
==24394== 
> library(units)

Attaching package: ‘units’

==24394== Conditional jump or move depends on uninitialised value(s)
==24394==    at 0x40300A9: ???
==24394==    by 0x142FE99F: ???
==24394==    by 0x142FE99F: ???
==24394==    by 0x142FE9A2: ???
==24394==    by 0xFFEFF8FCF: ???
==24394==  Uninitialised value was created by a heap allocation
==24394==    at 0x4C2DB2F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24394==    by 0x4F7D2E3: ??? (in /usr/lib/R/lib/libR.so)
==24394==    by 0x4F7ECC1: Rf_allocVector3 (in /usr/lib/R/lib/libR.so)
==24394==    by 0x4F2A73A: Rf_mkCharLenCE (in /usr/lib/R/lib/libR.so)
==24394==    by 0x4F808C7: Rf_install (in /usr/lib/R/lib/libR.so)
==24394==    by 0x4F81004: ??? (in /usr/lib/R/lib/libR.so)
==24394==    by 0x4F73167: setup_Rmainloop (in /usr/lib/R/lib/libR.so)
==24394==    by 0x4F74E28: Rf_mainloop (in /usr/lib/R/lib/libR.so)
==24394==    by 0x10887A: main (in /usr/lib/R/bin/exec/R)
==24394== 
The following object is masked from ‘package:base’:

    %*%

and have no clue what to do to find out what is causing this.

@Enchufa2
Copy link
Member

Interesting. Taking a look.

@edzer
Copy link
Member Author

edzer commented Jan 18, 2018

ac7bce0 simplifies the XPtr structure; neither that, nor removing .onLoad resolve the valgrind error.

@Enchufa2
Copy link
Member

I'd say this is an R problem, because I see the same report with the master branch, where there is no compiled code.

@edzer
Copy link
Member Author

edzer commented Jan 19, 2018

All CRAN versions before 0.5-0 ("overhaul"), or any other packages I'm involved in don't have it.

@t-kalinowski
Copy link
Contributor

t-kalinowski commented Jan 19, 2018

I'll try to post a longer response later this weekend.
My first thought is that it's related to the top-level environment that was introduced
here

edzer added a commit that referenced this issue Jan 19, 2018
@Enchufa2
Copy link
Member

The errors package reports the same on load. The thing in common is the matmult operator: I've removed it, reinstalled, and the conditional jump disappears.

@Enchufa2
Copy link
Member

@edzer Oh, you already found it. :) I'd say this confirms that this is an R problem. Is suppressing support for matrix multiplication the best approach?

@edzer
Copy link
Member Author

edzer commented Jan 20, 2018

I guess it was a false positive that would have gone away when I had built R instrumented with valgrind (see here), since CRAN results also doesn't list it.

I uncommented %*% for now but we can add it back later. I'm not fond of overwriting base functions and of the warning messages it generates, an alternative could be to introduce %u*% or so; I also don't know whether the way we now have a single unit for a complete matrix is what we want - thinking of covariance matrices.

@edzer
Copy link
Member Author

edzer commented Jan 20, 2018

Confirmed: when I build R after ./configure --with-valgrind-instrumentation=2 then it doesn't give the valgrind error we see above.

edzer added a commit that referenced this issue Jan 20, 2018
@t-kalinowski
Copy link
Contributor

Hi @edzer,

Glad the memory leak was diagnosed (and fixed? or it never really existed?).

Responding to the original issue, I really like the idea of removing install_conversion_constant() (and, i'm guessing install_conversion_function()?. However, I don't yet have an opinion about what the best user facing API to replace those functions should be.

I downloaded the udunits branch and I played around with it. I'm excited about the functionality but ambivalent about the interface. I suspect that rather than making a new family of functions install_conversion_{scale, offset, (and in time:..invert,raise,log}, feels both cluttered (lots of functions) and limiting because it doesn't allow for any composition (e.g, if you want to apply a scale and an offset, you need an intermediate "tmp" unit). Ultimately, I think that the cleanest interface might be to roll this all into install_symbolic_unit(), and accept a compact lambda syntax for defining units. I'm imagining something like

install_symbolic_unit(Kelvin = degree_C - 273.15)
install_symbolic_unit(Newton = m*kg / s^2)

With the example above, there are still quite a few details to work out, but I'm hoping for something that simple.

In order to get started on this, what do you think about making a set of functions internal to package:units that are thin wrappers around the udunits C API (with the naming convetion from package:udunits maintained). Scanning through the API these jump out as being useful primitives:

ud.new.dimensionless.unit()
ud.new.base.unit() # i'm guessing this will help solve the ud.are.convertable("cell", "rad") problem?

I think having this set of R functions that accept and return pointers to a ut_unit functions would be powerful.

ud.scale()
ud.offset()
ud.invert()
ud.raise()
ud.root()
ud.log()

and a set of functions to go from pointer <--> character

ud.parse()
ud.format()  # also, this looks like it'll be nice to replace 
# as.character.units(), allowing to select output in unit symbols or names
ud.map.unit.to.name() or ud.map.name.to.unit() # I don't get the difference

With this set of primitives, the examples above might be something like this:

install_symbolic_unit(Kelvin = degree_C - 273.15) # would in essence do: 
ud.parse("degree_C @ 273.15") %>% ud.map.unit.to.name("Kelvin")

install_symbolic_unit(Newton = m*kg / s^2)  # would in essence do: 
ud.parse("m*kg / s^2") %>% ud.map.unit.to.name("Newton")

@t-kalinowski
Copy link
Contributor

Actually, I wonder if we even need the full set of primitives I listed above. We could probably just get by with:

ud.parse(chr) # returns a pointer to a ut_unit
ud.map.unit.to.name(ut_unit_ptr, chr) 
ud.new.base.unit(chr)
ud.new.dimensionless.unit(chr)

ud.parse should allow us to offload all potential uses of the ud.scale, ud.offset, ud.invert... to the C library.

@t-kalinowski
Copy link
Contributor

so, I think this is much simpler than I originally though if we depend on ut_parse()
I now get:

> ptr <- R_ut_parse("degC @ 274")
> R_ut_map_name_to_unit("mykelvin", ptr)
<pointer: 0x4991420>
> 
> set_units(5, mykelvin) %>% set_units(degC)
279 °C
> 
> 
> ptr <- R_ut_parse("m*kg / s^2")
> R_ut_map_name_to_unit("myNewton", ptr)
<pointer: 0x5594030>
> set_units(5, myNewton) + set_units(5, Newton)
10 N

my corresponding cpp function is very simple:

// [[Rcpp::export]]
XPtrUT R_ut_map_name_to_unit( CharacterVector name, SEXP inunit) {
  ut_unit *unit = ut_unwrap(inunit);
  if (ut_map_name_to_unit(name[0], enc, unit) != UT_SUCCESS)
    handle_error("R_ut_add_scale");
  return ut_wrap(unit);
}

@t-kalinowski
Copy link
Contributor

t-kalinowski commented Jan 21, 2018

Here is a quick mockup of a potential implementation and interface with this approach.
Basic and most common usage examples:

install_symbolic_unit("cell") # new base unit by default
install_symbolic_unit("CFU", "cell * 4")

I'm hoping that any valid syntax accepted by the C library will work and we can just reproduce this table in the documentation: https://www.unidata.ucar.edu/software/udunits/udunits-2.0.4/udunits2lib.html#Examples
e.g., install_symbolic_unit("CFU", "4 cell")

# more complex examples with name and symbol aliases
install_symbolic_unit(name = c("cell", "cells"))
install_symbolic_unit(name = "colony_forming_unit",
                      def = "4 cells",
                      symbol = c("CFU", "cfu"))


# more complex unit definition
install_symbolic_unit("my_degC")
install_symbolic_unit("my_degF", "(my_degC * 9/5) + 32", "myF")

implementation mockup:

is_nullable_scalar_character <- function(x) {
    is.null(x) || is.character(x) && identical(length(x), 1L)
}
is_nullable_character <- function(x) {
    is.null(x) || is.character(x)
}


install_symbolic_unit(symbol = NULL, def = NULL, name = NULL) {

  stopifnot(is_nullable_scalar_character(def),
            is_nullable_character(symbol),
            is_nullable_character(name))

  if(is.null(def)) {
    ut_unit <- R_ut_new_base_unit()
  } else if (identical(def, "unitless")) {
    ut_unit <- R_ut_new_unitless_unit()
  } else {
    # convert offsets to valid udunits syntax that uses '@'
    def <- gsub("+", "@", def, fixed = TRUE) 
    def <- gsub("([^+*/-]\\s*)(-)", "\\1@ -", def, fixed = TRUE) 
    # this is just need to be more robust
    # perhaps some more processing for log() -> lg()/ln()/lb()
    stopifnot(R_ut_is_parseable(def))
    ut_unit <- R_ut_parse(def)
  }

  if(is.null(name) && is.null(symbol))
    stop("name or symbol must be specified")

  for (nm in seq_along(name))
    R_ut_map_unit_to_name(ut_unit, nm)

  for (sym in seq_along(symbol))
    R_ut_map_unit_to_symbol(ut_unit, sym)

  invisible(TRUE)
}

Supporting both symbols and names will enable supporting a more powerful
as.character.units() and format.units() powered by R_ut_format

@edzer
Copy link
Member Author

edzer commented Jan 21, 2018

@t-kalinowski yes, the memory "error" was a valgrind false positive.

I had only started with simple examples of what you can do with the udunits C pointer, but it is indeed pretty powerful. Some ideas:

  • have a shallow Rcpp interface, and have all the functions ut_someting interfaced by an R_ut_something with corresponding function parameter type and order. At the C++ level, the functions then need to do the necessary error handling.
  • have a single constructor function for creating symbolic_units objects that adds the external pointer to the ut_unit object, so we never have to reparse.
  • have a print method that adds (optionally?) the R_ut_format output, or prints only that format, for verification.
  • have a larger test suite that verifies that our R representation and the udunits internal representation are semantically identical.

An option would be to get away with the R representation altogether, but see #86.

I have been thinking about R_ut_new_base_unit for new units we don't want to be convertable with unitless (like g/g?), but it feels a bit like a hack. Maybe a good one?

@t-kalinowski
Copy link
Contributor

t-kalinowski commented Jan 21, 2018

All good idea, although I'm not sure the 2nd bullet point is worth it since parsing is quick enough, and implementing a cache is often more headache than it's worth (see frequently cited quote).

Just a quick response to the last point: as I've been mulling over on #91, I think the cleanest way to deal with units that are ratios composed of the same base unit, is to require users to create custom unique units that will inhibit unwanted unit simplification or conversion. for example:

install_symbolic_unit("g_arsenic")
install_symbolic_unit("kg_soil")
make_units(g_arsenic / kg_soil)

@edzer
Copy link
Member Author

edzer commented Jan 21, 2018

I think it is not acceptable when units does not properly cope with g/kg, mm/m etc. units 0.4-6 gives

> library(units)
> set_units(1:10, g/kg)
Units: 1
 [1] 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.010

which is the best we could do given that symbolic_units doesn't carry a multiplying constant.

@t-kalinowski
Copy link
Contributor

I find 0.4-6 behavior dangerous, as it can lead to unexpected and wrong behavior. Consider this use case: user has mass of arsenic and mass of soil and wants to calculate a concentration:

soil_mass <- set_units(1, kg)
arsenic_mass <- set_units(10, ug)
(concentration <- soil_mass / arsenic_mass)
# 1e+08 1

The dangerous part is that any subsequent calculations with concentration will be meaningless and probably wrong.
Consider the simple case of converting to another compatible unit
ug_arsenic / kg soil -> g arsenic / kg soil

set_units(concentration, g/kg) # user intending grams Arsenic / kg soil
# 1e+11 (g/kg)

Or, consider using concentration to calculate the total arsenic mass in a different mass of soil

arsenic_mass <- concentration * set_units(5, kg)
set_units(arsenic_mass, ug) # should be 50 ug Arsenic
# 5e+17 ug

The only way around this issue is to prevent unit simplification. However, there are many cases where you do want the units to simplify. Even in the example above, multiplying concentration * set_units(5, kg), you want kg to cancel out the denominator (kg soil), but not the numerator (As mass).
Unfortunately, without introducing new base units, there is no way for the package to infer when the user want units simplified, and when not.
Also, without introducing base units, there is lots of mental tracking requiring of the user to remember which mass unit corresponds to which compound.

The more I think about it, the more I think the way to solve this problem is through the example I posted above, using install_symbolic_unit()

@edzer
Copy link
Member Author

edzer commented Jan 21, 2018

What you want is a new feature, what I want is to remove a bug introduced in 0.5-0. Let's do that first.

edzer added a commit that referenced this issue Jan 22, 2018
t-kalinowski pushed a commit to t-kalinowski/units that referenced this issue Jan 23, 2018
With this implementation, the symbolic units are preserved so they can be printed correctly, and assigning a unit like "mg/kg" no longer changes the numeric value of the target vector
@edzer edzer closed this as completed in 6a30598 Dec 15, 2020
edzer added a commit that referenced this issue Dec 15, 2020
close #89: new install_unit and remove_unit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants