-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
139 lines (103 loc) · 5.38 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
comment = "##",
fig.path = "man/figures/README-"
)
```
# HSDS <img src="man/figures/logo.svg" align="right" height="140"/>
The goal of [HSDS](https://github.com/ABohynDOE/HSDS) is to make all the datasets of the book ["A Handbook of Small Data Sets"](https://www.routledge.com/A-Handbook-of-Small-Data-Sets/Hand-Daly-McConway-Lunn-Ostrowski/p/book/9780367449667) (1994) of David J. Hand available.
These data sets are especially useful for demonstrating statistical methods, testing functions, or teaching statistics and R programming.
While the individual datasets are already available in a [separate repository](https://github.com/JedStephens/Handbook-of-Small-Data-Sets/tree/master).
they are not formatted for immediate use in R and lack documentation.
This package addresses these issues by providing clean and fully documented datasets ready for analysis.
Do you like this package and want to support its development ?
[!["Buy Me A Coffee"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://www.buymeacoffee.com/abohyn)
## Installation
To install the development version of HSDS from GitHub, use the following command:
``` r
devtools::install_github("ABohynDOE/HSDS")
```
## Available data sets
```{r update-data-index, echo=FALSE, warning=FALSE, message=FALSE}
source("data-raw/data_index.R")
n_done <- nrow(data_index)
completed <- n_done / nrow(raw_data_index)
compl_perc <- scales::percent(completed)
bar_url <- glue::glue("https://progress-bar.xyz/{round(completed*100)}/?width=400")
```
The book contains over 500 datasets.
Currently, only `r n_done` datasets (`r compl_perc`) have been processed and included in this package.
![](`r bar_url`)
The table below summarizes 10 randomly selected datasets included so far, with details on their names, descriptions, structures, and variable types.
```{r data-table, results='asis', echo=FALSE}
# Load the current data index
load(here::here("R/sysdata.rda"))
# Reference url of the pkgdown website
refurl <- "http://abohyndoe.github.io/HSDS/reference/"
# Sample ten data sets
set.seed(123)
id <- sample(seq_len(nrow(data_index)), 10, replace = FALSE)
# Select data sets that are already present
data_index |>
dplyr::arrange(num) |>
dplyr::filter(num %in% id) |>
dplyr::mutate(url_name = glue::glue("[`{name}`]({refurl}{name}.html)")) |>
dplyr::select(url_name, description, size, col_types) |>
gt::gt() |>
gt::fmt_markdown() |>
gt::cols_align(align = "left", columns = gt::everything()) |>
gt::cols_label(
url_name = "Name",
description = "Description",
size = "Structure",
col_types = "Variable types"
) |>
gt::as_raw_html()
```
## Example
Here’s a simple example demonstrating how to use one of the datasets to create a visualization:
``` r
library(hsds)
library(ggplot2)
ggplot(germin, aes(x = water, y = seeds, color = box)) +
geom_boxplot(na.rm = T) +
theme_bw()
```
```{r example, echo=FALSE, out.width="500px"}
library(ggplot2)
load(here::here("data/germin.rda"))
ggplot(germin, aes(x = water, y = seeds, color = box)) +
geom_boxplot(na.rm = TRUE) +
theme_bw(12) +
ggeasy::easy_labs()
```
## Contributing to the package
We are far from reaching the goal of 500 datasets, so your contributions are more than welcome!
If you'd like to help, all raw datasets are already available in the repository under `data-raw/data-files`.
Feel free to clean one or more datasets and submit your contributions.
To simplify the contributing process, the package provides two helper functions:
1. **`data_list()`**
Use this function to list the datasets that have already been processed and identify the next datasets that need to be processed.
This ensures efficient collaboration and avoids duplication of effort.
2. **`data_setup(data)`**
This function sets up all the necessary files for processing a new dataset.
When you run `data_setup(data)`, it generates three files, all named `data.R`, but placed in different locations:
- **`inst/examples/`**: Contains an example of usage for the dataset.
- **`data-raw/`**: Includes a script to process the raw dataset.
- **`R/`**: Documents the dataset for use in the package.
When contributing, please also follow these guidelines:
- **Dataset Naming**
Name each dataset based on the data structure index provided in the book. The index is available [here](https://github.com/JedStephens/Handbook-of-Small-Data-Sets/blob/master/data_structure_index_HSDS.pdf) or in the Excel file `data-raw/raw_data_index.xlsx`.
- **Variable Labelling**
Ensure that all variables in the dataset are properly labelled. Labels don’t have to be long but should be meaningful to a newcomer. You can use the [`labelled`](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html) package or a similar tool to add these labels.
- **Documentation**
Document each dataset using the corresponding text from the book to maintain consistency and provide clear context.
- **Examples of Usage**
Add examples of how to use the datasets to your code. These examples should be saved as separate files in the `inst/examples` directory.
Your contributions will help us expand this resource and make it even more valuable for the community. Thank you for your support!