-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
155 lines (113 loc) · 5.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# modulartabler
<!-- badges: start -->
[![R-CMD-check](https://github.com/inpowell/modulartabler/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/inpowell/modulartabler/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
modulartabler categorises records in a dataset and presents counts in each category in tabular format. Each table of counts is described by a mapping table, which can convert raw data to tabular format. Mapping tables can correspond to a single dimension, or be combined to form a multi-dimensional table.
When presenting tables, sometimes cells with small counts must be suppressed, and other cells might need to be suppressed to prevent back-calculation. Mapping tables in modulartabler record the relationships within its table, and the package implements an algorithm to solve the cell suppression problem.
## Installation
You can install the development version of modulartabler like so:
``` r
remotes::install_github('inpowell/modulartabler')
```
## Example
In this example, we will use modulartabler to examine efficiency of vehicles in the `mtcars` dataset.
```{r example}
library(modulartabler)
data("mtcars")
```
First, we want to classify efficiency using the `mpg` column. To do this, we construct a `RangeMappingTable` that lists the efficiency in miles per galleon. We will categorise these as less than 20 mpg (<20.0), at least 20 but less than 25 mpg (20.0-24.9), and at least 25 mpg (25.0+). We will also add a subtotal for cars with at least 20 mpg (20.0+).
```{r case-status}
MPG <- RangeMappingTable$new(
'Miles per galleon', 'mpg',
"<20.0" = c(0, 20),
"20.0-24.9" = c(20, 25),
"25.0+" = c(25, Inf),
"20.0+" = c(20, Inf),
.other = "Unknown",
.total = 'Total'
)
```
Now we can use this mapping table to count the number of cars by their mileage.
```{r case-table}
MPG$count_aggregate(mtcars)
```
We may also be interested in whether cars in this dataset are more efficient depending on whether the are automatic or manual. We can construct a mapping table for transmission using `BaseMappingTable`.
```{r am-map}
am_map <- data.frame(
Transmission = factor(c('Automatic', 'Manual', rep('Total', 2L))),
am = c( 0, 1, 0, 1 )
)
Transmission <- BaseMappingTable$new(
am_map, raw_cols = 'am', table_cols = 'Transmission'
)
```
Using this mapping table, we can see that a bit over half the cars in our data are automatic.
```{r am-table}
Transmission$count_aggregate(mtcars)
```
However, to answer our question about whether efficiency is affected by mileage, we need the cross-tabulation of these two dimensions. We can create a mapping table to get this cross-tabulation using a `MultiMappingTable`.
```{r cross-map}
CrossMpgTransmission <- MultiMappingTable$new(Transmission, MPG)
```
Then, we are able to calculate counts in each category.
```{r cross-table}
crosstab <- CrossMpgTransmission$count_aggregate(mtcars)
crosstab
```
While this table has all the information we need, it would be more appropriate to have it in wide format for readablity. We can do this using `convert_tabular()`.
```{r cross-wide}
convert_tabular(crosstab, `Miles per galleon` ~ Transmission)
```
This shows all the information we need, including that manual cars tend to have better mileage.
If this table referred to people, then it might be prudent to suppress small cells. In particular, we might want to suppress the cell with count 3.
```{r}
crosstab$suppress <- 0L < crosstab$n & crosstab$n <= 3L
crosstab$n.primary <- crosstab$n
crosstab$n.primary[crosstab$suppress] <- NA
```
When we look at this table though, it is clear we can easily recover the value of the suppressed cell. Indeed, the cell's row total is 18, and $18 - 15 = 3$. Also, its column total is 13, and the complementary cell for 20+ mpg is 10, and $13 - 10 = 3$ as well.
```{r}
convert_tabular(
crosstab,
`Miles per galleon` ~ Transmission,
measures = 'n.primary'
)
```
So, we need to conduct secondary suppression. The relationships of a table are available from its `MappingTable` object in the `nullspace` binding.
```{r}
str(CrossMpgTransmission$nullspace)
```
Each row of this matrix describes a different relationship within the table. We can use this information to conduct secondary suppression using the `suppress_secondary()` function.
```{r}
crosstab$suppress2ary <- suppress_secondary(
N = crosstab$n,
suppress = crosstab$suppress,
# NB crosstab must have same order as CrossMpgTransmission$mtab
nullspace = CrossMpgTransmission$nullspace
)
```
When we look at the table following this suppression, there is no way to back-calculate the primary-suppressed cell exactly.
```{r}
crosstab$n.secondary <- crosstab$n
crosstab$n.secondary[crosstab$suppress2ary] <- NA
convert_tabular(
crosstab,
`Miles per galleon` ~ Transmission,
measures = 'n.secondary'
)
```
Unfortunately, we have also lost a lot of information in our table. This highlights the need for care in designing tables so that they have sufficient data and small cells are rare.
## Thanks
The authors would like to acknowledge the [NSW Ministry of Health](https://www.health.nsw.gov.au/), which supported the initial development of this package while the authors (Ian Powell, Hafiz Khusyairi, and Peter Moritz) were employed there.