This repository has been archived by the owner on Jun 27, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathuseful_packages.qmd
128 lines (76 loc) · 12.4 KB
/
useful_packages.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
title: "Useful packages"
author: "Phillip Alday"
---
Unlike R, Julia does not immediately expose a huge number of functions, but instead requires loading packages (whether from the standard library or from the broader package ecosystem) for a lot of relevant functionality for statistical analysis.
There are technical reasons for this, such as the ease of using the Julia package system.
One further motivation is that Julia is aimed at a broader "technical computing" audience (like MATLAB or perhaps Python) and less at a "statistical analysis" audience.
This has two important implications:
1. Even relatively simple programs will often load several packages.
2. Packages are often focused on adding a relatively narrow set of functionality, which means that "basic" functionality (e.g. reading a CSV file and manipulating it as a DataFrame) is often split across multiple packages. In other words, **see the the first point!**
This notebook is not intended to be an exhaustive list of packages, but rather to highlight a few packages that I suspect will be particularly useful.
Before getting onto the packages, I have one final hint: take advantage of how easy and first-class package management in Julia is.
Having good package management makes reproducible analyses much easier and avoids breaking old analyses when you start a new one.
The package-manager REPL mode (activated by typing `]` at the `julia>` prompt) is very useful.
# Data wrangling
## Reading data
- [Arrow.jl](https://arrow.juliadata.org/dev/manual/) a high performance format for data storage, accessible in R via the [`arrow` package](https://arrow.apache.org/docs/r/) and in Python via `pyarrow`. (Confusingly, the function for reading and writing Arrow format files in R is called `read_feather` and `write_feather`, but the modern Arrow format is distinct from the older Feather format provided by the `feather` package.) This is the format that we store the example and test datasets in for MixedModels.jl.
- [CSV.jl](https://csv.juliadata.org/stable/index.html) useful for reading comma-separated values, tab-separated values and basically everything handled by the `read.csv` and `read.table` family of functions in R.
Note that by default both Arrow.jl and CSV.jl do not return a DataFrame, but rather "column tables" -- named tuples of column vectors.
## DataFrames
Unlike in R, DataFrames are not part of the base language, nor the standard library.
[DataFrames.jl](https://dataframes.juliadata.org/stable/) provides the basic infrastructure around DataFrames, as well as its own [mini language](https://bkamins.github.io/julialang/2020/12/24/minilanguage.html) for doing the split-apply-combine approach that underlies R's `dplyr` and much of the tidyverse. The DataFrames.jl documentation is the place for looking at how to e.g. read in a [CSV or Arrow file as a DataFrame](https://dataframes.juliadata.org/stable/man/importing_and_exporting/). Note that DataFrames.jl by default depends on [CategoricalArrays.jl](https://categoricalarrays.juliadata.org/stable/) to handle the equivalent of `factor` in the R world, but there is an alternative package for `factor`-like array type in Julia, [PooledArrays.jl](https://github.com/JuliaData/PooledArrays.jl/). PooledArrays are simpler, but more limited than CategoricalArrays and we (Phillip and Doug) sometimes use them in our examples and simulations.
The tables produced by reading an Arrow file have their own representation of factor-like data as `DictEncoded` arrays.
DataFrame.jl's mini language can be a bit daunting, if you're used to manipulations in the style of base R or the tidyverse. For that, there are several options; recently, we'e had particularly nice experiences with [DataFrameMacros.jl](https://github.com/jkrumbiegel/DataFrameMacros.jl) and [Chain.jl](https://github.com/jkrumbiegel/Chain.jl) for a convenient syntax to connect or "pipe" together successive operations. It's your choice whether and which of these add-ons you want to use! Phillip tends to write his code using raw DataFrames.jl, but Doug really enjoys DataFrameMacros.jl.
The recently added [Tidier](https://github.com/TidierOrg/) collection of Julia packages is popular with those coming from the tidyverse.
## Regression
Unlike in R, neither formula processing nor basic regression are part of the base language or the standard library.
The formula syntax and basic contrast-coding schemes in Julia is provided by [StatsModels.jl](https://juliastats.org/StatsModels.jl/v0.6/). By default, MixedModels.jl re-exports the `@formula` macro and most commonly used contrast schemes from StatsModels.jl, so you often don't have to worry about loading StatsModels.jl directly. The same is true for [GLM.jl](https://juliastats.org/GLM.jl/dev/manual/), which provides basic linear and generalized linear models, such as ordinary least squares (OLS) regression and logistic regression, i.e. the classical, non mixed regression models.
The basic functionality looks quite similar to R, e.g.
```julia
julia > lm(@formula(y ~ 1 + x), data)
julia > glm(@formula(y ~ 1 + x), data, Binomial(), LogitLink())
```
but the more general modelling API (also used by MixedModels.jl) is also supported:
```julia
julia > fit(LinearModel, @formula(y ~ 1 + x), mydata)
julia > fit(
GeneralizedLinearModel,
@formula(y ~ 1 + x),
data,
Binomial(),
LogitLink(),
)
```
(You can also specify your model matrices directly and skip the formula interface, but we don't recommend this as it's easy to mess up in really subtle but very problematic ways.)
## `@formula`, macros and domain-specific languages
As a sidebar: why is `@formula` a macro and not a normal function? Well, that's because formulas are essentially their own domain-specific language (a variant of [Wilkinson-Rogers notation](https://www.jstor.org/stable/2346786)) and macros are used for manipulating the language itself -- or in this case, handling an entirely new, embedded language! This is also why macros are used by packages like [Turing.jl](https://turing.ml/) and [Soss.jl](https://cscherrer.github.io/Soss.jl/stable/) that define a language for Bayesian probabilistic programming like [PyMC3](https://docs.pymc.io/) or [Stan](https://mc-stan.org/).
## Extensions to the formula syntax
There are several ongoing efforts to extend the formula syntax to include some of the "extras" available in R, e.g. [RegressionFormulae.jl](https://github.com/kleinschmidt/RegressionFormulae.jl) to use the caret (`^`) notation to limit interactions to a certain order (`(a+b+c)^2` generates `a + b + c + a&b + a&c + b&c`, but not `a&b&c`).
Note also that Julia uses `&` to express interactions, not `:` like in R.
## Standardizing Predictors
Although function calls such as `log` can be used within Julia formulae, they must act on a rowwise basis, i.e. on observations. Transformations such as z-scoring or centering (often done with `scale` in R) require knowledge of the entire column. [StandardizedPredictors.jl](https://beacon-biosignals.github.io/StandardizedPredictors.jl/stable/) provides functions for centering, scaling, and z-scoring within the formula. These are treated as pseudo-contrasts and computed on demand, meaning that `predict` and `effects` (see next) computations will handle these transformations on new data (e.g. centering new data *around the mean computed during fitting the original data*) correctly and automatically.
## Effects
John Fox's `effects` package in R (and the related `ggeffects` package for plotting these using `ggplot2`) provides a nice way to visualize a model's overall view of the data. This functionality is provided by [`Effects.jl`](https://beacon-biosignals.github.io/Effects.jl/stable/) and works out-of-the-box with most regression model packages in Julia (including MixedModels.jl). Support for formulae with embedded functions (such as `log`) is not yet complete, but we're working on it!
## Estimated Marginal / Least Square Means
[`Effects.jl`](https://beacon-biosignals.github.io/Effects.jl/stable/) provides a subset of the functionality (basic estimated-marginal means and exhaustive pairwise comparisons) of the R package `emmeans` package. However, it is often better to use sensible, hypothesis-driven contrast coding than to compute all pairwise comparisons after the fact. 😃
# Hypothesis Testing
Classical statistical tests such as the t-test can be found in the package [HypothesisTests.jl](https://github.com/JuliaStats/HypothesisTests.jl/).
# Plotting ecosystem
Throughout this course, we have used the Makie ecosystem for plotting, but there are several alternatives in Julia.
## Makie
The [Makie ecosystem](https://makie.org/) is a relatively new take on graphics that aims to be both powerful and easy to use. Makie.jl itself only provides abstract definitions for many components (and is used in e.g. MixedModelsMakie.jl to define plot types for MixedModels.jl). The actual plotting and rendering is handled by a backend package such as CairoMakie.jl (good for Quarto notebooks or rending static 2D images) and GLMakie.jl (good for dynamic, interactive visuals and 3D images). AlgebraOfGraphics.jl builds a grammar of graphics upon the Makie framework. It's a great way to get good plots very quickly, but extensive customization is still best achieved by using Makie directly.
## Plots.jl
[Plots.jl](https://docs.juliaplots.org/latest/) is the original plotting package in Julia, but we often find it difficult to work with compared to some of the other alternatives. [StatsPlots.jl](https://github.com/JuliaPlots/StatsPlots.jl) builds on this, adding common statistical plots, while [UnicodePlots.jl](https://github.com/Evizero/UnicodePlots.jl) renders plots as Unicode characters directly in the REPL.
[PGFPlotsX.jl](https://kristofferc.github.io/PGFPlotsX.jl/stable/) is a very new package that writes directly to PGF (the format used by LaTeX's tikz framework) and can stand alone or be used as a rendering backend for the Plots.jl ecosystem.
## Gadfly
[Gadfly.jl](https://gadflyjl.org/stable/) was the original attempt to create a plotting system in Julia based on the grammar of graphics (the "gg" in `ggplot2`). Development has largely stalled, but some functionality still exceeds AlgebraOfGraphics.jl, which has taken up the grammar of graphics mantle. Notably, the MixedModels.jl documentation still uses Gadfly as of this writing (early September 2021).
## Others
There are many [other graphics packages available in Julia](https://juliapackages.com/c/graphics), often wrapping well-established frameworks such as [VegaLite](https://www.queryverse.org/VegaLite.jl/stable/).
# Connecting to Other Languages
Using Julia doesn't mean you have to leave all the packages you knew in other languages behind. In Julia, it's often possible to even easily and quickly invoke code from other languages *from within Julia*.
[RCall.jl](https://juliainterop.github.io/RCall.jl/stable/gettingstarted/) provides a very convenient interface for interacting with R. [JellyMe4.jl](https://github.com/palday/JellyMe4.jl/) adds support for moving MixedModels.jl and `lme4` models back and forth between the languages (which means that you can use `emmeans`, `sjtools`, `DHARMa`, `car`, etc. to examine MixedModels.jl models!). [RData.jl](https://github.com/JuliaData/RData.jl) provides support for reading `.rds` and `.rda` files from Julia, while [RDatasets.jl](https://github.com/JuliaStats/RDatasets.jl) provides convenient access to many of the standard datasets provided by R and various R packages.
[PyCall.jl](https://github.com/JuliaPy/PyCall.jl/) provides a very convenient way for interacting with Python code and packages. [PyPlot.jl](https://github.com/JuliaPy/PyPlot.jl) builds upon this foundation to provide support for Python's `matplotlib`. Similarly, [PyMNE.jl](https://github.com/beacon-biosignals/PyMNE.jl) and [PyFOOOF.jl](https://github.com/beacon-biosignals/pyfooof.jl) provide some additional functionality to make interacting with MNE-Python and FOOOF from within Julia even easier than with vanilla PyCall. More recently, [PythonCall.jl](https://github.com/cjdoris/PythonCall.jl) has proven to be a populat alternative to PyCall.jl.
For MATLAB users, there is also [MATLAB.jl](https://github.com/JuliaInterop/MATLAB.jl)
[Cxx.jl](https://juliainterop.github.io/Cxx.jl/stable/) provides interoperability with C++. It also provides a C++ REPL mode, making it possible to treating C++ much more like a dynamic language than the traditional compiler toolchain would allow.
Support for calling C and Fortran is [part of the Julia standard library](https://docs.julialang.org/en/v1/manual/calling-c-and-fortran-code/).