This repository has been archived by the owner on Sep 20, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy patharrow.qmd
291 lines (222 loc) · 9.37 KB
/
arrow.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: "Notes on saved data files"
jupyter: julia-1.8
---
## The Arrow storage format
The [Arrow storage format](https://arrow.apache.org) provides a language-agnostic storage and memory specification for columnar data tables, which just means "something that looks like a data frame in R". That is, an arrow table is an ordered, named collection of columns, all of the same length.
The columns can be of different types including numeric values, character strings, and factor-like representations - called *DictEncoded*.
An Arrow file can be read or written from R, Python, Julia and many other languages. Somewhat confusingly in R and Python the name `feather`, which refers to an earlier version of the storage format, is used in some function names like `read_feather`.
## The Emotikon data
The [SMLP2021 repository](https://github.com/RePsychling/SMLP2021) contains a version of the data from @Fuehner2021 in `notebooks/data/fggk21.arrow`.
After that file was created there were changes in the master RDS file on the [osf.io site](https://osf.io/2d8rj/) for the project.
We will recreate the Arrow file here then split it into two separate tables, one with a row for each child in the study and one with a row for each test result.
The `Arrow` package for Julia does not export any function names, which means that the function to read an Arrow file must be called as `Arrow.Table`.
It returns a *column table*, as described in the [Tables package](https://github.com/JuliaData/Tables.jl).
This is like a read-only data frame, which can be easily converted to a full-fledged `DataFrame` if desired.
This arrangement allows for the `Arrow` package not to depend on the `DataFrames` package, which is a heavy-weight dependency, but still easily produce a `DataFrame` if warranted.
Load the packages to be used.
```{julia}
#| code-fold: true
using AlgebraOfGraphics
using Arrow
using CairoMakie
using Chain
using DataFrameMacros
using DataFrames
using Downloads
using KernelDensity
using RCall # access R from within Julia
using StatsBase
CairoMakie.activate!(; type="svg")
using AlgebraOfGraphics: density
```
## Downloading and importing the RDS file
This is similar to some of the code shown by Julius Krumbiegel on Monday.
In the data directory of the emotikon project on osf.io under Data, the url for the rds data file is found to be [https://osf.io/xawdb/]. Note that we want version 2 of this file.
```{julia}
fn = Downloads.download("https://osf.io/xawdb/download?version=2");
```
```{julia}
dfrm = rcopy(R"readRDS($fn)")
```
Now write this file as a Arrow file and read it back in.
```{julia}
arrowfn = joinpath("data", "fggk21.arrow")
Arrow.write(arrowfn, dfrm; compress=:lz4)
tbl = Arrow.Table(arrowfn)
```
```{julia}
filesize(arrowfn)
```
```{julia}
df = DataFrame(tbl)
```
## Avoiding needless repetition
One of the principles of relational database design is that information should not be repeated needlessly.
Each row of `df` is determined by a combination of `Child` and `Test`, together producing a `score`, which can be converted to a `zScore`.
The other columns in the table, `Cohort`, `School`, `age`, and `Sex`, are properties of the `Child`.
Storing these values redundantly in the full table takes up space but, more importantly, allows for inconsistency.
As it stands, a given `Child` could be recorded as being in one `Cohort` for the `Run` test and in another `Cohort` for the `S20_r` test and nothing about the table would detect this as being an error.
The approach used in relational databases is to store the information for `score` in one table that contains only `Child`, `Test` and `score`, store the information for the `Child` in another table including `Cohort`, `School`, `age` and `Sex`.
These tables can then be combined to create the table to be used for analysis by *joining* the different tables together.
The maintainers of the `DataFrames` package have put in a lot of work over the past few years to make joins quite efficient in Julia.
Thus the processing penalty of reassembling the big table from three smaller tables is minimal.
It is important to note that the main advantage of using smaller tables that are joined together to produce the analysis table is the fact that the information in the analysis table is consistent by design.
## Creating the smaller table
```{julia}
Child = unique(select(df, :Child, :School, :Cohort, :Sex, :age))
```
```{julia}
length(unique(Child.Child)) # should be 108295
```
```{julia}
filesize(
Arrow.write("./data/fggk21_Child.arrow", Child; compress=:lz4)
)
```
```{julia}
filesize(
Arrow.write(
"./data/fggk21_Score.arrow",
select(df, :Child, :Test, :score);
compress=:lz4,
),
)
```
::: {.callout-note}
A careful examination of the file sizes versus that of `./data/fggk21.arrow` will show that the separate tables combined take up more space than the original because of the compression. Compression algorithms are often more successful when applied to larger files.
:::
Now read the Arrow tables in and reassemble the original table.
```{julia}
Score = DataFrame(Arrow.Table("./data/fggk21_Score.arrow"))
```
At this point we can create the z-score column by standardizing the scores for each `Test`.
The code to do this follows Julius's presentation on Monday.
```{julia}
@transform!(groupby(Score, :Test), :zScore = @c zscore(:score))
```
```{julia}
Child = DataFrame(Arrow.Table("./data/fggk21_Child.arrow"))
```
```{julia}
df1 = disallowmissing!(leftjoin(Score, Child; on=:Child))
```
::: {.callout-note}
The call to `disallowmissing!` is because the join will create columns that allow for missing values but we know that we should not get missing values in the result. This call will fail if, for some reason, missing values were created.
:::
## Discovering patterns in the data
One of the motivations for creating the `Child` table was be able to bin the ages according to the age of each child, not the age of each `Child-Test` combination.
Not all children have all 5 test results.
We can check the number of results by grouping on `:Child` and evaluate the number of rows in each group.
```{julia}
nobsChild = combine(groupby(Score, :Child), nrow => :ntest)
```
Now create a table of the number of children with 1, 2, ..., 5 test scores.
```{julia}
combine(groupby(nobsChild, :ntest), nrow)
```
A natural question at this point is whether there is something about those students who have few observations.
For example, are they from only a few schools?
One approach to examining properties like is to add the number of observations for each child to the :Child table.
Later we can group the table according to this `:ntest` to look at properties of `:Child` by `:ntest`.
```{julia}
gdf = groupby(
disallowmissing!(leftjoin(Child, nobsChild; on=:Child)), :ntest
)
```
Are the sexes represented more-or-less equally?
```{julia}
combine(groupby(first(gdf), :Sex), nrow => :nchild)
```
```{julia}
combine(groupby(last(gdf), :Sex), nrow => :nchild)
```
What about the distribution of ages?
```{julia}
#| eval: false
"""
ridgeplot!(ax::Axis, df::AbstractDataFrame, densvar::Symbol, group::Symbol; normalize=false)
ridgeplot!(f::Figure, args...; pos=(1,1) kwargs...)
ridgeplot(args...; kwargs...)
Create a "ridge plot".
A ridge plot is stacked plot of densities for a given variable (`densvar`) grouped by a different variable (`group`). Because densities can very widely in scale, it is sometimes useful to `normalize` the densities so that each density has a maximum of 1.
The non-mutating method creates a Figure before calling the method for Figure.
The method for Figure places the ridge plot in the grid position specified by `pos`, default is (1,1).
"""
function ridgeplot!(
ax::Axis,
df::AbstractDataFrame,
densvar::Symbol,
group::Symbol;
normalize=false,
)
# `normalize` makes it so that the max density is always 1
# `normalize` works on the density not the area/mass
gdf = groupby(df, group)
dens = combine(gdf, densvar => kde => :kde)
sort!(dens, group)
spacing = normalize ? 1.0 : 0.9 * maximum(dens[!, :kde]) do val
return maximum(val.density)
end
nticks = length(gdf)
for (idx, row) in enumerate(eachrow(dens))
dd = if normalize
row.kde.density ./ maximum(row.kde.density)
else
row.kde.density
end
offset = idx * spacing
lower = Node(Point2f.(row.kde.x, offset))
upper = Node(Point2f.(row.kde.x, dd .+ offset))
band!(ax, lower, upper; color=(:black, 0.3))
lines!(ax, upper; color=(:black, 1.0))
end
ax.yticks[] = (
1:spacing:(nticks * spacing), string.(dens[!, group])
)
ylims!(ax, 0, (nticks + 2) * spacing)
ax.xlabel[] = string(densvar)
ax.ylabel[] = string(group)
return ax
end
```
```{julia}
#| eval: false
function ridgeplot!(f::Figure, args...; pos=(1, 1), kwargs...)
ridgeplot!(Axis(f[pos...]), args...; kwargs...)
return f
end
```
```{julia}
#| eval: false
"""
ridgeplot(args...; kwargs...)
See [ridgeplot!](@ref).
"""
function ridgeplot(args...; kwargs...)
return ridgeplot!(Figure(), args...; kwargs...)
end
```
```{julia}
#| eval: false
ridgeplot(parent(gdf), :age, :ntest)
```
```{julia}
parent(gdf)
```
## Reading Arrow files in other languages
There are Arrow implementations for R (the `arrow` package) and for Python (`pyarrow`).
```{python}
#| eval: false
import pyarrow.feather: read_table
read_table("./data/fggk21.arrow")
```
```{R}
#| eval: false
library("arrow")
fggk21 <- read_feather("./data/fggk21.arrow")
nrow(fggk21)
```
# References
::: {#refs}
:::