Skip to content

Commit

Permalink
Merge branch 'rc_datawizard_0.13.0' of https://github.com/easystats/d…
Browse files Browse the repository at this point in the history
…atawizard into rc_datawizard_0.13.0
  • Loading branch information
etiennebacher committed Oct 3, 2024
2 parents f090bde + be756b6 commit 2fdf02e
Show file tree
Hide file tree
Showing 9 changed files with 176 additions and 46 deletions.
10 changes: 10 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,22 @@ BREAKING CHANGES

* Removed deprecated arguments `group` and `na.rm` in multiple functions. Use `by` and `remove_na` instead (#546).

* The default value for the argument `dummy_factors` in `to_numeric()` has
changed from `TRUE` to `FALSE` (#544).

CHANGES

* The `pattern` argument in `data_rename()` can also be a named vector. In this
case, names are used as values for the `replacement` argument (i.e. `pattern`
can be a character vector using `<new name> = "<old name>"`).

* `categorize()` gains a new `breaks` argument, to decide whether breaks are
inclusive or exclusive (#548).

* The `labels` argument in `categorize()` gets two new options, `"range"` and
`"observed"`, to use the range of categorized values as labels (i.e. factor
levels) (#548).

* Minor additions to `reshape_ci()` to work with forthcoming changes in the
`{bayestestR}` package.

Expand Down
69 changes: 52 additions & 17 deletions R/categorize.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,18 @@
#' for numeric variables, the minimum of the original input is preserved. For
#' factors, the default minimum is `1`. For `split = "equal_range"`, the
#' default minimum is always `1`, unless specified otherwise in `lowest`.
#' @param breaks Character, indicating whether breaks for categorizing data are
#' `"inclusive"` (values indicate the _upper_ bound of the _previous_ group or
#' interval) or `"exclusive"` (values indicate the _lower_ bound of the _next_
#' group or interval to begin). Use `labels = "range"` to make this behaviour
#' easier to see.
#' @param labels Character vector of value labels. If not `NULL`, `categorize()`
#' will returns factors instead of numeric variables, with `labels` used
#' for labelling the factor levels. Can also be `"mean"` or `"median"` for a
#' factor with labels as the mean/median of each groups.
#' for labelling the factor levels. Can also be `"mean"`, `"median"`,
#' `"range"` or `"observed"` for a factor with labels as the mean/median,
#' the requested range (even if not all values of that range are present in
#' the data) or observed range (range of the actual recoded values) of each
#' group. See 'Examples'.
#' @param append Logical or string. If `TRUE`, recoded or converted variables
#' get new column names and are appended (column bind) to `x`, thus returning
#' both the original and the recoded variables. The new columns get a suffix,
Expand All @@ -53,7 +61,7 @@
#'
#' # Splits and breaks (cut-off values)
#'
#' Breaks are in general _exclusive_, this means that these values indicate
#' Breaks are by default _exclusive_, this means that these values indicate
#' the lower bound of the next group or interval to begin. Take a simple
#' example, a numeric variable with values from 1 to 9. The median would be 5,
#' thus the first interval ranges from 1-4 and is recoded into 1, while 5-9
Expand All @@ -63,6 +71,9 @@
#' from 1 to 3 belong to the first interval and are recoded into 1 (because
#' the next interval starts at 3.67), 4 to 6 into 2 and 7 to 9 into 3.
#'
#' The opposite behaviour can be achieved using `breaks = "inclusive"`, in which
#' case
#'
#' # Recoding into groups with equal size or range
#'
#' `split = "equal_length"` and `split = "equal_range"` try to divide the
Expand Down Expand Up @@ -119,6 +130,13 @@
#' x <- sample(1:10, size = 30, replace = TRUE)
#' categorize(x, "equal_length", n_groups = 3, labels = "mean")
#' categorize(x, "equal_length", n_groups = 3, labels = "median")
#'
#' # cut numeric into groups with the requested range as a label name
#' # each category has the same range, and labels indicate this range
#' categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "range")
#' # in this example, each category has the same range, but labels only refer
#' # to the ranges of the actual values (present in the data) inside each group
#' categorize(mtcars$mpg, "equal_length", n_groups = 5, labels = "observed")
#' @export
categorize <- function(x, ...) {
UseMethod("categorize")
Expand All @@ -142,6 +160,7 @@ categorize.numeric <- function(x,
n_groups = NULL,
range = NULL,
lowest = 1,
breaks = "exclusive",
labels = NULL,
verbose = TRUE,
...) {
Expand All @@ -152,6 +171,9 @@ categorize.numeric <- function(x,
if (identical(split, "equal_length")) split <- "length"
if (identical(split, "equal_range")) split <- "range"

# check for valid values
breaks <- match.arg(breaks, c("exclusive", "inclusive"))

# save
original_x <- x

Expand All @@ -169,9 +191,9 @@ categorize.numeric <- function(x,
}

if (is.numeric(split)) {
breaks <- split
category_splits <- split
} else {
breaks <- switch(split,
category_splits <- switch(split,
median = stats::median(x),
mean = mean(x),
length = n_groups,
Expand All @@ -182,15 +204,18 @@ categorize.numeric <- function(x,
}

# complete ranges, including minimum and maximum
if (!identical(split, "length")) breaks <- unique(c(min(x), breaks, max(x)))
if (!identical(split, "length")) {
category_splits <- unique(c(min(x), category_splits, max(x)))
}

# recode into groups
out <- droplevels(cut(
x,
breaks = breaks,
breaks = category_splits,
include.lowest = TRUE,
right = FALSE
right = identical(breaks, "inclusive")
))
cut_result <- out
levels(out) <- 1:nlevels(out)

# fix lowest value, add back into original vector
Expand All @@ -201,7 +226,7 @@ categorize.numeric <- function(x,
original_x[!is.na(original_x)] <- out

# turn into factor?
.original_x_to_factor(original_x, x, labels, out, verbose, ...)
.original_x_to_factor(original_x, x, cut_result, labels, out, verbose, ...)
}


Expand All @@ -223,6 +248,7 @@ categorize.data.frame <- function(x,
n_groups = NULL,
range = NULL,
lowest = 1,
breaks = "exclusive",
labels = NULL,
append = FALSE,
ignore_case = FALSE,
Expand Down Expand Up @@ -260,6 +286,7 @@ categorize.data.frame <- function(x,
n_groups = n_groups,
range = range,
lowest = lowest,
breaks = breaks,
labels = labels,
verbose = verbose,
...
Expand All @@ -276,6 +303,7 @@ categorize.grouped_df <- function(x,
n_groups = NULL,
range = NULL,
lowest = 1,
breaks = "exclusive",
labels = NULL,
append = FALSE,
ignore_case = FALSE,
Expand Down Expand Up @@ -319,6 +347,7 @@ categorize.grouped_df <- function(x,
n_groups = n_groups,
range = range,
lowest = lowest,
breaks = breaks,
labels = labels,
select = select,
exclude = exclude,
Expand Down Expand Up @@ -375,20 +404,26 @@ categorize.grouped_df <- function(x,
}


.original_x_to_factor <- function(original_x, x, labels, out, verbose, ...) {
.original_x_to_factor <- function(original_x, x, cut_result, labels, out, verbose, ...) {
if (!is.null(labels)) {
if (length(labels) == length(unique(out))) {
original_x <- as.factor(original_x)
levels(original_x) <- labels
} else if (length(labels) == 1 && labels %in% c("mean", "median")) {
} else if (length(labels) == 1 && labels %in% c("mean", "median", "range", "observed")) {
original_x <- as.factor(original_x)
no_na_x <- original_x[!is.na(original_x)]
if (labels == "mean") {
labels <- stats::aggregate(x, list(no_na_x), FUN = mean, na.rm = TRUE)$x
} else {
labels <- stats::aggregate(x, list(no_na_x), FUN = stats::median, na.rm = TRUE)$x
}
levels(original_x) <- insight::format_value(labels, ...)
out <- switch(labels,
mean = stats::aggregate(x, list(no_na_x), FUN = mean, na.rm = TRUE)$x,
median = stats::aggregate(x, list(no_na_x), FUN = stats::median, na.rm = TRUE)$x,
# labels basically like what "cut()" returns
range = levels(cut_result),
# range based on the values that are actually present in the data
{
temp <- stats::aggregate(x, list(no_na_x), FUN = range, na.rm = TRUE)$x
apply(temp, 1, function(i) paste0("(", paste(as.vector(i), collapse = "-"), ")"))
}
)
levels(original_x) <- insight::format_value(out, ...)
} else if (isTRUE(verbose)) {
insight::format_warning(
"Argument `labels` and levels of the recoded variable are not of the same length.",
Expand Down
18 changes: 9 additions & 9 deletions R/to_numeric.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@
#' @inheritParams extract_column_names
#' @inheritParams categorize
#'
#' @note By default, `to_numeric()` converts factors into "binary" dummies, i.e.
#' @note When factors should be converted into multiple "binary" dummies, i.e.
#' each factor level is converted into a separate column filled with a binary
#' 0-1 value. If only one column is required, use `dummy_factors = FALSE`. If
#' you want to preserve the original factor levels (in case these represent
#' numeric values), use `preserve_levels = TRUE`.
#' 0-1 value, set `dummy_factors = TRUE`. If you want to preserve the original
#' factor levels (in case these represent numeric values), use
#' `preserve_levels = TRUE`.
#'
#' @section Selection of variables - `select` argument:
#' For most functions that have a `select` argument the complete input data
Expand All @@ -34,12 +34,12 @@
#'
#' @examples
#' to_numeric(head(ToothGrowth))
#' to_numeric(head(ToothGrowth), dummy_factors = FALSE)
#' to_numeric(head(ToothGrowth), dummy_factors = TRUE)
#'
#' # factors
#' x <- as.factor(mtcars$gear)
#' to_numeric(x, dummy_factors = FALSE)
#' to_numeric(x, dummy_factors = FALSE, preserve_levels = TRUE)
#' to_numeric(x)
#' to_numeric(x, preserve_levels = TRUE)
#' # same as:
#' coerce_to_numeric(x)
#'
Expand Down Expand Up @@ -69,7 +69,7 @@ to_numeric.default <- function(x, verbose = TRUE, ...) {
to_numeric.data.frame <- function(x,
select = NULL,
exclude = NULL,
dummy_factors = TRUE,
dummy_factors = FALSE,
preserve_levels = FALSE,
lowest = NULL,
append = FALSE,
Expand Down Expand Up @@ -191,7 +191,7 @@ to_numeric.POSIXlt <- to_numeric.Date

#' @export
to_numeric.factor <- function(x,
dummy_factors = TRUE,
dummy_factors = FALSE,
preserve_levels = FALSE,
lowest = NULL,
verbose = TRUE,
Expand Down
27 changes: 24 additions & 3 deletions man/categorize.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 8 additions & 8 deletions man/to_numeric.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 2fdf02e

Please sign in to comment.