README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  comment = "##",
  fig.path = "man/figures/README-"
)
```

# HSDS <img src="man/figures/logo.svg" align="right" height="140"/>

The goal of [HSDS](https://github.com/ABohynDOE/HSDS) is to make all the datasets of the book ["A Handbook of Small Data Sets"](https://www.routledge.com/A-Handbook-of-Small-Data-Sets/Hand-Daly-McConway-Lunn-Ostrowski/p/book/9780367449667) (1994) of David J. Hand available.
These data sets are especially useful for demonstrating statistical methods, testing functions, or teaching statistics and R programming.

While the individual datasets are already available in a [separate repository](https://github.com/JedStephens/Handbook-of-Small-Data-Sets/tree/master).
they are not formatted for immediate use in R and lack documentation.
This package addresses these issues by providing clean and fully documented datasets ready for analysis.

Do you like this package and want to support its development ?
[!["Buy Me A Coffee"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://www.buymeacoffee.com/abohyn)

## Installation

To install the development version of HSDS from GitHub, use the following command:

``` r
devtools::install_github("ABohynDOE/HSDS")
```

## Available data sets

```{r update-data-index, echo=FALSE, warning=FALSE, message=FALSE}
source("data-raw/data_index.R")
n_done <- nrow(data_index)
completed <- n_done / nrow(raw_data_index)
compl_perc <- scales::percent(completed)
bar_url <- glue::glue("https://progress-bar.xyz/{round(completed*100)}/?width=400")
```

The book contains over 500 datasets. 
Currently, only `r n_done` datasets (`r compl_perc`) have been processed and included in this package.

![](`r bar_url`)

The table below summarizes 10 randomly selected datasets included so far, with details on their names, descriptions, structures, and variable types.


```{r data-table, results='asis', echo=FALSE}
# Load the current data index
load(here::here("R/sysdata.rda"))

# Reference url of the pkgdown website
refurl <- "http://abohyndoe.github.io/HSDS/reference/"

# Sample ten data sets
set.seed(123)
id <- sample(seq_len(nrow(data_index)), 10, replace = FALSE)

# Select data sets that are already present
data_index |>
  dplyr::arrange(num) |>
  dplyr::filter(num %in% id) |>
  dplyr::mutate(url_name = glue::glue("[`{name}`]({refurl}{name}.html)")) |>
  dplyr::select(url_name, description, size, col_types) |>
  gt::gt() |>
  gt::fmt_markdown() |>
  gt::cols_align(align = "left", columns = gt::everything()) |>
  gt::cols_label(
    url_name = "Name",
    description = "Description",
    size = "Structure",
    col_types = "Variable types"
  ) |> 
  gt::as_raw_html()
```

## Example

Here’s a simple example demonstrating how to use one of the datasets to create a visualization:

``` r
library(hsds)
library(ggplot2)

ggplot(germin, aes(x = water, y = seeds, color = box)) +
  geom_boxplot(na.rm = T) +
  theme_bw()
```

```{r example, echo=FALSE, out.width="500px"}
library(ggplot2)
load(here::here("data/germin.rda"))
ggplot(germin, aes(x = water, y = seeds, color = box)) +
  geom_boxplot(na.rm = TRUE) +
  theme_bw(12) +
  ggeasy::easy_labs()
```

## Contributing to the package

We are far from reaching the goal of 500 datasets, so your contributions are more than welcome!
If you'd like to help, all raw datasets are already available in the repository under `data-raw/data-files`.
Feel free to clean one or more datasets and submit your contributions.

To simplify the contributing process, the package provides two helper functions:

1. **`data_list()`**  
   Use this function to list the datasets that have already been processed and identify the next datasets that need to be processed.
   This ensures efficient collaboration and avoids duplication of effort.

2. **`data_setup(data)`**  
   This function sets up all the necessary files for processing a new dataset.
   When you run `data_setup(data)`, it generates three files, all named `data.R`, but placed in different locations:
   - **`inst/examples/`**: Contains an example of usage for the dataset.
   - **`data-raw/`**: Includes a script to process the raw dataset.
   - **`R/`**: Documents the dataset for use in the package.

When contributing, please also follow these guidelines:

- **Dataset Naming**  
   Name each dataset based on the data structure index provided in the book. The index is available [here](https://github.com/JedStephens/Handbook-of-Small-Data-Sets/blob/master/data_structure_index_HSDS.pdf) or in the Excel file `data-raw/raw_data_index.xlsx`.

- **Variable Labelling**  
   Ensure that all variables in the dataset are properly labelled. Labels don’t have to be long but should be meaningful to a newcomer. You can use the [`labelled`](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html) package or a similar tool to add these labels.

- **Documentation**  
   Document each dataset using the corresponding text from the book to maintain consistency and provide clear context.

- **Examples of Usage**  
   Add examples of how to use the datasets to your code. These examples should be saved as separate files in the `inst/examples` directory.

Your contributions will help us expand this resource and make it even more valuable for the community. Thank you for your support!