Skip to content

Commit

Permalink
Merge pull request #4 from peterdesmet/main
Browse files Browse the repository at this point in the history
Typo fixes + opinionated styling choices
  • Loading branch information
khusmann authored Jun 7, 2024
2 parents c07ebde + de00986 commit 6cd6863
Show file tree
Hide file tree
Showing 6 changed files with 42 additions and 44 deletions.
6 changes: 3 additions & 3 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ ex$age
Computations automatically operate on values:

```{r}
mean(ex$age, na.rm=TRUE)
mean(ex$age, na.rm = TRUE)
```

But the missing reasons are still there! To indicate a value should be treated
Expand All @@ -156,7 +156,7 @@ reason:
```{r}
ex |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) %>%
Expand Down Expand Up @@ -202,7 +202,7 @@ ex |>
You may notice that on large datasets `interlacer` runs significantly slower
than `readr` / `vroom`. Although `interlacer` uses `vroom` under the hood to load
delimited data, it is not able to take advantage of many of its optimizations
because `vroom` does not
because `vroom`
[does not currently support](https://github.com/tidyverse/vroom/issues/532)
column-level missing values. As soon as `vroom` supports column-level
missing values, I will be able to remedy this!
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ ex$age
Computations automatically operate on values:

``` r
mean(ex$age, na.rm=TRUE)
mean(ex$age, na.rm = TRUE)
#> [1] 25.375
```

Expand All @@ -199,7 +199,7 @@ missing reason:
``` r
ex |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) %>%
Expand Down Expand Up @@ -282,7 +282,7 @@ ex |>
You may notice that on large datasets `interlacer` runs significantly
slower than `readr` / `vroom`. Although `interlacer` uses `vroom` under
the hood to load delimited data, it is not able to take advantage of
many of its optimizations because `vroom` does not [does not currently
many of its optimizations because `vroom` [does not currently
support](https://github.com/tidyverse/vroom/issues/532) column-level
missing values. As soon as `vroom` supports column-level missing values,
I will be able to remedy this!
Expand Down
29 changes: 13 additions & 16 deletions vignettes/coded-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,19 @@ read_file(

Where missing reasons are:

> -99: N/A
> `-99`: N/A
>
> -98: REFUSED
> `-98`: REFUSED
>
> -97: OMITTED
> `-97`: OMITTED
And colors are coded:

> 1: BLUE
> `1`: BLUE
>
> 2: RED
> `2`: RED
>
> 3: YELLOW
> `3`: YELLOW
This format gives you the ability to load everything as a numeric type:

Expand All @@ -80,7 +80,7 @@ df_coded |>
age = if_else(age > 0, age, NA)
) |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
Expand All @@ -102,7 +102,7 @@ df_coded |>
# age = if_else(age > 0, age, NA)
) |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
Expand Down Expand Up @@ -169,7 +169,7 @@ keep cross-referencing your codebook to know what values mean:
```{r}
df_decoded |>
summarize(
mean_age = mean(age, na.rm=TRUE),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
Expand All @@ -185,8 +185,6 @@ df_decoded |>
)
```



## Numeric codes with character missing reasons (SAS, Stata)

Like SPSS, SAS and Stata will encode factor levels as numeric values, but
Expand All @@ -203,12 +201,11 @@ read_file(
Here, the same value codes are used as the previous example, except the missing
reasons are coded as follows:

> ".": N/A
> `"."`: N/A
>
> ".a": REFUSED
> `".a"`: REFUSED
>
> ".b": OMITTED
> `".b"`: OMITTED
To handle these missing reasons without interlacer, columns must be loaded as
character vectors:
Expand All @@ -229,7 +226,7 @@ df_coded_char |>
age = if_else(!is.na(as.numeric(age)), as.numeric(age), NA)
) |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
Expand Down
8 changes: 4 additions & 4 deletions vignettes/interlacer.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ library(dplyr, warn.conflicts = FALSE)
df_simple |>
summarize(
mean_age = mean(age, na.rm = T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
Expand Down Expand Up @@ -98,7 +98,7 @@ df_with_missing |>
age_values = as.numeric(if_else(age %in% reasons, NA, age)),
) |>
summarize(
mean_age = mean(age_values, na.rm=T),
mean_age = mean(age_values, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
Expand Down Expand Up @@ -169,7 +169,7 @@ the unique missing reasons, rather than being lumped into a single `NA`:
```{r}
df |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color
) |>
Expand Down Expand Up @@ -392,4 +392,4 @@ In all the examples in this vignette, column types were automatically detected.
To explicitly specify value and missing column types, (and specify individual
missing reasons for specific columns), interlacer extends
`readr`'s `collector()` system. This will be covered in the next vignette,
`vignette("na-column-types")`
`vignette("na-column-types")`.
8 changes: 5 additions & 3 deletions vignettes/na-column-types.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,16 @@ This is useful when you have missing reasons that only apply to particular items
as opposed to the file as a whole. For example, say we had a measure with the
following two items:

> 1. What is your current stress level?
1. What is your current stress level?

> a. Low
> b. Moderate
> c. High
> d. I don't know
> e. I don't understand the question
>
> 2. How well do you feel you manage your time and responsibilities today?
2. How well do you feel you manage your time and responsibilities today?

> a. Poorly
> b. Fairly well
> c. Well
Expand Down
29 changes: 14 additions & 15 deletions vignettes/other-approaches.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ df_spss |>
)
) |>
summarize(
mean_age = mean(age_values, na.rm=T),
mean_age = mean(age_values, na.rm = TRUE),
n = n(),
.by = favorite_color_missing_reasons
)
Expand All @@ -103,7 +103,7 @@ This creates a lot more type gymnastics and potential errors when you're
manipulating them.

Reason 2: Even when the missing values are labelled in the `labelled_spss` type,
aggregations and other math operatiosn are not protected. If you forget
aggregations and other math operations are not protected. If you forget
to take out your missing values, you get incorrect results / corrupted data:

```{r}
Expand All @@ -114,7 +114,7 @@ df_spss |>
)
) |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color_missing_reasons
)
Expand Down Expand Up @@ -151,11 +151,11 @@ character "tag" (usually a letter from a-z). This means that they work with
```{r}
is.na(df_stata$age)
mean(df_stata$age, na.rm=TRUE)
mean(df_stata$age, na.rm = TRUE)
```

Unfortunately, you can't group by them, because `dplyr::group_by()` is not
missing tag-aware :(
tag-aware. :(

```{r}
df_stata |>
Expand All @@ -165,7 +165,7 @@ df_stata |>
)
) |>
summarize(
mean_age = mean(age, na.rm=T),
mean_age = mean(age, na.rm = TRUE),
n = n(),
.by = favorite_color_missing_reasons
)
Expand Down Expand Up @@ -195,7 +195,6 @@ of the object:
# All the missing reason info is tracked in the attributes
attributes(dcl)
# The data stored has actual NA values, so it works as you would expect
# with summary stats like `mean()`, etc.
attributes(dcl) <- NULL
Expand All @@ -207,15 +206,15 @@ This means aggregations work exactly as you would expect!
```{r}
dcl <- declared(c(1, 2, 3, -99, -98), na_values = c(-99, -98))
sum(dcl, na.rm=TRUE)
sum(dcl, na.rm = TRUE)
```

## interlacer

interlacer builds on the ideas of haven, labelled, and declared with following
goals:

1. Be fully generic: Add a missing value channel to *any* vector type.
### 1. Be fully generic: Add a missing value channel to *any* vector type

As mentioned above, `haven::labelled_spss()` only works with `numeric`
and `character` types, and `haven::tagged_na()` only works with `numeric` types.
Expand Down Expand Up @@ -250,12 +249,12 @@ int

This data structure drives their functional API, described in (3) below.

2. Provide functions for reading / writing interlaced CSV files (not just SPSS
### 2. Provide functions for reading / writing interlaced CSV files (not just SPSS
/ SAS / Stata files)

(See `interlacer::read_interlaced_csv()`, etc.)
See `interlacer::read_interlaced_csv()`, etc.

3. Provide a functional API that integrates well into tidy pipelines
### 3. Provide a functional API that integrates well into tidy pipelines

interlacer provides functions to facilitate working with the `interlaced` type
as a [Result type](https://en.wikipedia.org/wiki/Result_type),
Expand Down Expand Up @@ -292,7 +291,7 @@ plays nicely with all the packages in the tidyverse.

## Questions for the future

1. More flexible missing reason channel types?
### 1. More flexible missing reason channel types?

Earlier versions allowed arbitrary types to occupy
the missing reason channel (i.e. it was a fully generic Result<Value, Missing>
Expand All @@ -305,9 +304,9 @@ tell, in 99.9% of the time, it is preferable to use `integer` and `factor`
missing reason channels over `double` and `character` ones, so for now I've
made the executive decision to only allow `integer` and `factor` types.

2. A better `na_cols()` specification?
### 2. A better `na_cols()` specification?

Right now, missing values are supplied in `na` a separate argument from
Right now, missing values are supplied in a separate argument from
`col_types`. This means custom missing values get pretty far separated from
their `col_type` definitions:

Expand Down

0 comments on commit 6cd6863

Please sign in to comment.