diff --git a/articles/coded-data.html b/articles/coded-data.html index f685f76..4ab67cf 100644 --- a/articles/coded-data.html +++ b/articles/coded-data.html @@ -124,15 +124,15 @@

Numeric codes with neg #> 11,10,-98

Where missing reasons are:

-

-99: N/A

-

-98: REFUSED

-

-97: OMITTED

+

-99: N/A

+

-98: REFUSED

+

-97: OMITTED

And colors are coded:

-

1: BLUE

-

2: RED

-

3: YELLOW

+

1: BLUE

+

2: RED

+

3: YELLOW

This format gives you the ability to load everything as a numeric type:

@@ -165,7 +165,7 @@

Numeric codes with neg age = if_else(age > 0, age, NA) ) |> summarize( - mean_age = mean(age, na.rm=T), + mean_age = mean(age, na.rm = TRUE), n = n(), .by = favorite_color ) |> @@ -192,7 +192,7 @@

Numeric codes with neg # age = if_else(age > 0, age, NA) ) |> summarize( - mean_age = mean(age, na.rm=T), + mean_age = mean(age, na.rm = TRUE), n = n(), .by = favorite_color ) |> @@ -302,7 +302,7 @@

Numeric codes with neg
 df_decoded |>
   summarize(
-    mean_age = mean(age, na.rm=TRUE),
+    mean_age = mean(age, na.rm = TRUE),
     n = n(),
     .by = favorite_color
   ) |>
@@ -363,9 +363,9 @@ 

Numeric codes wi

Here, the same value codes are used as the previous example, except the missing reasons are coded as follows:

-

“.”: N/A

-

“.a”: REFUSED

-

“.b”: OMITTED

+

".": N/A

+

".a": REFUSED

+

".b": OMITTED

To handle these missing reasons without interlacer, columns must be loaded as character vectors:

@@ -397,7 +397,7 @@

Numeric codes wi age = if_else(!is.na(as.numeric(age)), as.numeric(age), NA) ) |> summarize( - mean_age = mean(age, na.rm=T), + mean_age = mean(age, na.rm = TRUE), n = n(), .by = favorite_color ) |> diff --git a/articles/interlacer.html b/articles/interlacer.html index 8a573d7..ce9a305 100644 --- a/articles/interlacer.html +++ b/articles/interlacer.html @@ -151,7 +151,7 @@

Aggregations with missing reasons df_simple |> summarize( - mean_age = mean(age, na.rm = T), + mean_age = mean(age, na.rm = TRUE), n = n(), .by = favorite_color ) |> @@ -201,7 +201,7 @@

Aggregations with missing reasons age_values = as.numeric(if_else(age %in% reasons, NA, age)), ) |> summarize( - mean_age = mean(age_values, na.rm=T), + mean_age = mean(age_values, na.rm = TRUE), n = n(), .by = favorite_color ) |> @@ -291,7 +291,7 @@

The interlacer approach
 df |>
   summarize(
-    mean_age = mean(age, na.rm=T),
+    mean_age = mean(age, na.rm = TRUE),
     n = n(),
     .by = favorite_color
   ) |>
@@ -671,7 +671,7 @@ 

Next stepsvignette("na-column-types")

+vignette("na-column-types").

diff --git a/articles/na-column-types.html b/articles/na-column-types.html index c539140..6ff68ff 100644 --- a/articles/na-column-types.html +++ b/articles/na-column-types.html @@ -146,10 +146,10 @@

This is useful when you have missing reasons that only apply to particular items as opposed to the file as a whole. For example, say we had a measure with the following two items:

-
  1. What is your current stress level?
+
  1. Low
  2. Moderate
  3. @@ -157,10 +157,12 @@

  4. I don’t know
  5. I don’t understand the question
+
  1. How well do you feel you manage your time and responsibilities today?
+
  1. Poorly
  2. Fairly well
  3. diff --git a/articles/other-approaches.html b/articles/other-approaches.html index b442b39..066c73f 100644 --- a/articles/other-approaches.html +++ b/articles/other-approaches.html @@ -175,7 +175,7 @@

    “Labelled” missing value ) ) |> summarize( - mean_age = mean(age_values, na.rm=T), + mean_age = mean(age_values, na.rm = TRUE), n = n(), .by = favorite_color_missing_reasons ) @@ -217,7 +217,7 @@

    “Labelled” missing value codes. This creates a lot more type gymnastics and potential errors when you’re manipulating them.

    Reason 2: Even when the missing values are labelled in the -labelled_spss type, aggregations and other math operatiosn +labelled_spss type, aggregations and other math operations are not protected. If you forget to take out your missing values, you get incorrect results / corrupted data:

    @@ -228,7 +228,7 @@ 

    “Labelled” missing value ) ) |> summarize( - mean_age = mean(age, na.rm=T), + mean_age = mean(age, na.rm = TRUE), n = n(), .by = favorite_color_missing_reasons ) @@ -298,10 +298,10 @@

    “Tagged” missing values (#> [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE

     
    -mean(df_stata$age, na.rm=TRUE)
    +mean(df_stata$age, na.rm = TRUE)
     #> [1] 25.375

    Unfortunately, you can’t group by them, because -dplyr::group_by() is not missing tag-aware :(

    +dplyr::group_by() is not tag-aware. :(

     df_stata |>
       mutate(
    @@ -310,7 +310,7 @@ 

    “Tagged” missing values ( ) ) |> summarize( - mean_age = mean(age, na.rm=T), + mean_age = mean(age, na.rm = TRUE), n = n(), .by = favorite_color_missing_reasons ) @@ -365,7 +365,6 @@

    declared#> [1] "declared" "numeric"

     
    -
     # The data stored has actual NA values, so it works as you would expect
     # with summary stats like `mean()`, etc.
     attributes(dcl) <- NULL
    @@ -375,7 +374,7 @@ 

    declared
     dcl <- declared(c(1, 2, 3, -99, -98), na_values = c(-99, -98))
     
    -sum(dcl, na.rm=TRUE)
    +sum(dcl, na.rm = TRUE)
     #> [1] 6

+
+

1. Be fully generic: Add a missing value channel to any +vector type +

As mentioned above, haven::labelled_spss() only works with numeric and character types, and haven::tagged_na() only works with numeric @@ -438,15 +437,18 @@

interlacer#> [1] 1 2 3 NA NA

This data structure drives their functional API, described in (3) below.

-
    -
  1. Provide functions for reading / writing interlaced CSV files (not -just SPSS / SAS / Stata files)
  2. -
-

(See interlacer::read_interlaced_csv(), etc.)

-
    -
  1. Provide a functional API that integrates well into tidy -pipelines
  2. -
+ +
+

2. Provide functions for reading / writing interlaced CSV files (not +just SPSS +

+

/ SAS / Stata files)

+

See interlacer::read_interlaced_csv(), etc.

+
+
+

3. Provide a functional API that integrates well into tidy +pipelines +

interlacer provides functions to facilitate working with the interlaced type as a Result type, a well-understood abstraction in functional programming. The functions @@ -489,12 +491,13 @@

interlacer

Questions for the future

-
    -
  1. More flexible missing reason channel types?
  2. -
+
+

1. More flexible missing reason channel types? +

Earlier versions allowed arbitrary types to occupy the missing reason channel (i.e. it was a fully generic Result<Value, Missing> type). I ended up constricting the missing reason channel to only allow @@ -507,13 +510,13 @@

Questions for the futuredouble and character ones, so for now I’ve made the executive decision to only allow integer and factor types.

-
    -
  1. A better na_cols() specification?
  2. -
-

Right now, missing values are supplied in na a separate -argument from col_types. This means custom missing values -get pretty far separated from their col_type -definitions:

+

+
+

2. A better na_cols() specification? +

+

Right now, missing values are supplied in a separate argument from +col_types. This means custom missing values get pretty far +separated from their col_type definitions:

diff --git a/index.html b/index.html index cdfed3d..10b858c 100644 --- a/index.html +++ b/index.html @@ -181,7 +181,7 @@

Usage #> NA levels: REFUSED OMITTED N/A

Computations automatically operate on values:

-mean(ex$age, na.rm=TRUE)
+mean(ex$age, na.rm = TRUE)
 #> [1] 25.375

But the missing reasons are still there! To indicate a value should be treated as a missing reason instead of a regular value, you can use the na() function. The following, for example, will filter the data set for all individuals that REFUSED to give their favorite color:

@@ -197,7 +197,7 @@ 

Usage
 ex |>
   summarize(
-    mean_age = mean(age, na.rm=T),
+    mean_age = mean(age, na.rm = TRUE),
     n = n(),
     .by = favorite_color
   ) %>%
@@ -267,7 +267,7 @@ 

Known Issues
  • Performance with large data sets
  • -

    You may notice that on large datasets interlacer runs significantly slower than readr / vroom. Although interlacer uses vroom under the hood to load delimited data, it is not able to take advantage of many of its optimizations because vroom does not does not currently support column-level missing values. As soon as vroom supports column-level missing values, I will be able to remedy this!

    +

    You may notice that on large datasets interlacer runs significantly slower than readr / vroom. Although interlacer uses vroom under the hood to load delimited data, it is not able to take advantage of many of its optimizations because vroom does not currently support column-level missing values. As soon as vroom supports column-level missing values, I will be able to remedy this!