Skip to content

Commit

Permalink
Merge pull request #37 from antarctica/documentation_improvements
Browse files Browse the repository at this point in the history
Incorporate reviewer feedback
  • Loading branch information
thomaszwagerman authored Jan 14, 2025
2 parents a74a6ed + 08b7556 commit f68179f
Show file tree
Hide file tree
Showing 7 changed files with 75 additions and 22 deletions.
2 changes: 1 addition & 1 deletion R/catch.R
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ catch <- function(
bullet_col = "orange"
)
} else {
# By using an inner join, we drop any row which does not match in
# By using an anti join, we drop any row which does not match in
# df_previous.
df_rows_changed_from_previous <- suppressMessages(
dplyr::anti_join(
Expand Down
8 changes: 7 additions & 1 deletion R/loupe.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,13 @@
#' (usually "time" or "datetime in a timeseries).
#'
#' It informs the user of new (unmatched) rows which have appeared, and then
#' returns a `waldo::compare()` call to give a detailed breakdown of changes.
#' returns a `waldo::compare()` call to give a detailed breakdown of changes. If
#' you are not familiar with `waldo::compare()`, this is an expanded and more
#' verbose function similar to base R's `all.equal()`.
#'
#' `loupe()` will then return TRUE if there are not changes to previous data,
#' or FALSE if there are unexpected changes. If you want to extract changes as
#' a dataframe, use `catch()`, or if you want to drop them, use `release()`.
#'
#' The main assumption is that `df_current` and `df_previous` are a newer and
#' older versions of the same data, and that the `datetime_variable` variable
Expand Down
12 changes: 12 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -194,3 +194,15 @@ Other functions include `all.equal()` (base R) or [dplyr](https://github.com/tid
## `butterfly` in production

Read more about how `butterfly` is [used in an operational data pipeline](https://thomaszwagerman.github.io/butterfly/articles/butterfly_in_pipeline.html) to verify a continually updated **and** published dataset.

## Contributing

For full guidance on contributions, please refer to `.github/CONTRIBUTING.md`

### Without write access
Corrections, suggestions and general improvements are welcome as issues.

You can also suggest changes by forking this repository, and opening a pull request. Please target your pull requests to the main branch.

### With write access
You can push directly to main for small fixes. Please use PRs to main for discussing larger updates.
19 changes: 18 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# butterfly <a href="https://thomaszwagerman.github.io/butterfly/"><img src="man/figures/logo.png" align="right" height="139" alt="butterfly website" /></a>

<!-- badges: start -->

[![R-CMD-check](https://github.com/thomaszwagerman/butterfly/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/thomaszwagerman/butterfly/actions/workflows/R-CMD-check.yaml)
[![Codecov test
coverage](https://codecov.io/gh/thomaszwagerman/butterfly/branch/main/graph/badge.svg)](https://app.codecov.io/gh/thomaszwagerman/butterfly?branch=main)
Expand All @@ -13,7 +14,6 @@ stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://
state and is being actively
developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![pkgcheck](https://github.com/thomaszwagerman/butterfly/workflows/pkgcheck/badge.svg)](https://github.com/thomaszwagerman/butterfly/actions?query=workflow%3Apkgcheck)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/676_status.svg)](https://github.com/ropensci/software-review/issues/676)
<!-- badges: end -->

The goal of butterfly is to aid in the verification of continually
Expand Down Expand Up @@ -375,3 +375,20 @@ Other functions include `all.equal()` (base R) or
Read more about how `butterfly` is [used in an operational data
pipeline](https://thomaszwagerman.github.io/butterfly/articles/butterfly_in_pipeline.html)
to verify a continually updated **and** published dataset.

## Contributing

For full guidance on contributions, please refer to
`.github/CONTRIBUTING.md`

### Without write access

Corrections, suggestions and general improvements are welcome as issues.

You can also suggest changes by forking this repository, and opening a
pull request. Please target your pull requests to the main branch.

### With write access

You can push directly to main for small fixes. Please use PRs to main
for discussing larger updates.
8 changes: 7 additions & 1 deletion man/loupe.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

20 changes: 12 additions & 8 deletions vignettes/articles/butterfly_in_pipeline.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,9 @@ This is what will be submitted to the PDC.

### Configuration

Firstly let's look at our configuration, which is stored in an `ENVS` file. This determines the locations of our input data, our output data and where we will eventually publish our data, among other useful parameters:
Firstly let's look at our configuration, which is stored in an `ENVS` file. This determines the locations of our input data, our output data and where we will eventually publish our data, among other useful parameters.

If you are not familiar with an `ENVS` file, this is a text file which exports environmental variables that subsequently be used as part of a Bash shell script. Using an ENVS file is useful, as it allows us to quickly change pipeline parameters, without altering the code of the pipeline itself.

```bash
## Directories
Expand Down Expand Up @@ -138,6 +140,8 @@ export FILE_IDENTIFIER=${CURRENT_YEAR}

Now that we have set our configuration, let's inspect the shell script which actually runs our pipeline, `run_asli_pipeline.sh`.

These scripts would be run in a Bash shell, e.g. to run this script you would use `bash run_asli_pipeline.sh` in a terminal.

```bash
#!/bin/bash
set -e
Expand All @@ -161,7 +165,7 @@ do
done
```

The above concerns setting up our pipeline with input and output directories, as well as fetching all environmental variables.
The above concerns setting up our pipeline with input and output directories, as well as fetching all environmental variables, by sourcing `ENVS`.

Next is the calculation step, using the functionality from the `asli` package:

Expand All @@ -178,18 +182,18 @@ asli_data_era5 $DATA_ARGS_ERA5
asli_calc $DATA_DIR/era5_mean_sea_level_pressure_monthly_*.nc -o $OUTPUT_DIR/asli_calculation_$FILE_IDENTIFIER.csv
```

Lovely, we now have our calculations ready in `$OUTPUT_DIR`, to rsync to a location given to us by the PDC. To do so for the first time, we will run:
Lovely, we now have our calculations ready in `$OUTPUT_DIR`, to rsync to a location given to us by the UK Polar Data Centre (UK PDC). To do so for the first time, we will run:

```bash
rsync $OUTPUT_DIR/*.csv $RSYNC_LOCATION
echo "Writing to $RSYNC_LOCATION."
```

Let's pretend this was our first submission to the PDC. For any subsequent submission, we will want to use `butterfly` to compare our new results with the file we have just submitted to the `$RSYNC_LOCATION`, to make sure previous values have not changed.
Let's pretend that this was our first submission to the UK PDC. For any subsequent submission, we will want to use `butterfly` to compare our new results with the file we have just submitted to the `$RSYNC_LOCATION`, to make sure previous values have not changed.

#### Incorporate R and `butterfly` into a shell-scripted pipeline

We are going to implement this in an R script called `quality_control.R`, but we will have to provide it with our new calculations and the calculations we did previously and transferred to `$RSYNC_LOCATION`, like:
We are going to implement this in an R script called `quality_control.R`, but we will have to provide it with our new results and the results we submitted to the UK PDC, in the `$RSYNC_LOCATION`:

```bash
Rscript quality_control.R "$OUTPUT_DIR/asli_calculation_$FILE_IDENTIFIER.csv" "$RSYNC_LOCATION/asli_calculation_$FILE_IDENTIFIER.csv"
Expand All @@ -198,7 +202,7 @@ Rscript quality_control.R "$OUTPUT_DIR/asli_calculation_$FILE_IDENTIFIER.csv" "$
Here, `$OUTPUT_DIR/asli_calculation_$FILE_IDENTIFIER.csv` is our most recent calculation, in `quality_control.R` this will be referred to as `args[1]`.
The previous calculation, `$RSYNC_LOCATION/asli_calculation_$FILE_IDENTIFIER.csv`, will be `args[2]`.

Let's have a look at `quality_control.R` now. We started off with making this script executable by the shell, provide the user with some instructions on how to use the script, and by obtaining the arguments it was given in `args`.
Let's have a look at `quality_control.R` now. We start off with making this script executable by the shell, provide the user with some instructions on how to use the script, and by obtaining the arguments it was given in `args`.

```R
#!/usr/bin/env Rscript
Expand All @@ -214,7 +218,7 @@ Next, we will test if those arguments were actually provided, and if so we read
# Test if there is two arguments: the output and previous file
if (length(args)!=2) {
stop(
"Please provide the output file, and the file it is being compared to", call.=FALSE
"Please provide the output file, and the file it is being compared to", call. = FALSE
)
} else {

Expand All @@ -231,7 +235,7 @@ existing_file <- readr::read_csv(

Great! Now that the files have been read in, we can start our verification using `butterfly`.

In this case, we will use `butterfly::loupe()` to give us our report, and return either TRUE (previous data has not changed, we are happy to proceed) or FALSE (a change in previous data has been detected, and we should abort data transfer).
In this case, we will use `butterfly::loupe()` to give us our report, and return either `TRUE` (previous data has not changed, we are happy to proceed) or `FALSE` (a change in previous data has been detected, and we should abort data transfer).

```R
# Use butterfly to check there are no changes to past data
Expand Down
28 changes: 18 additions & 10 deletions vignettes/butterfly.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,12 @@ The goal of butterfly is to aid in the verification of continually updating time

Unnoticed changes in previous data could have unintended consequences, such as invalidating DOIs, or altering future predictions if used as input in forecasting models.

Other unnoticed changes could include a jump in time or measurement frequency, due to instrument failure or software updates.

This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.

You can provide butterfly with a timeseries dataset to check for continuity, with `timeline()`, and if it is not fully continuous as expected, split it into continuous chunks with `timeline_group()`. To check for changes to previous data, you can provide two versions of the *same* dataset, and `loupe()` will check if there are changes to matching rows, and tell you which rows are new. You can use `catch()` and `release()` to extract or remove rows with changes. Full examples of functionality are provided below.

## Data

This packages includes a small dummy dataset, `butterflycount`, which contains a list of monthly dataframes of butterfly counts for a given date.
Expand All @@ -29,7 +33,7 @@ library(butterfly)
butterflycount
```

This dataset is entirely fictional, and merely included to aid demonstrating butterfly's functionality.
This dataset is entirely fictional, and merely included to aid in demonstrating butterfly's functionality.

Another dummy dataset, `forestprecipitation`, also contains a list of monthly dataframes, but for fictional rainfall data. This dataset is intended to illustrate an instance of instrument failure leading to timesteps being recorded out of sync.

Expand Down Expand Up @@ -57,13 +61,15 @@ butterfly::loupe(

`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. `waldo::compare()` is then used to compare these and provide a detailed report of the differences.

In addition to a report, `loupe()` also returns `TRUE` if there are no differences and `FALSE` when there are differences. This is especially useful when using butterfly in a pipeline that runs in a shell environment, a check for differences to fail gracefully.

`butterfly` follows the `waldo` philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.

### Additional arguments from `waldo::compare()`

You have the flexibility to pass further arguments that `waldo::compare()` accepts, to any butterfly function, for instance to specify the tolerance.
You have the flexibility to pass further arguments accepted by `waldo::compare()`to any of `loupe()`, `catch()` or `release()`.

If we add a tolerance of 2 to the previous example, no differences should be returned:
One such argument is `tolerance`. If we add a tolerance of 2 to the previous example, no differences should be returned:

```{r tolerance_example}
butterfly::loupe(
Expand All @@ -74,11 +80,11 @@ butterfly::loupe(
)
```

Call `?waldo::compare()` to see the full list of arguments.
See `?waldo::compare()` for the full list of arguments.

## Extracting unexpected changes: `catch()`

You might want to return changed rows as a dataframe. For this `butterfly::catch()`is provided.
You might want to return changed rows as a dataframe, instead of returning `TRUE`/`FALSE`. For this `butterfly::catch()`is provided.

`butterfly::catch()` only returns rows which have **changed** from the previous version. It will not return new rows.

Expand All @@ -94,7 +100,7 @@ df_caught

## Dropping unexpected changes: `release()`

Conversely, `butterfly::release()` drops all rows which had changed from the previous version. Note it retains new rows, as these were expected.
Conversely, `butterfly::release()` drops all rows which have changed from the previous version. Note it retains new rows, as these were expected.

```{r butterfly_release}
df_released <- butterfly::release(
Expand Down Expand Up @@ -127,7 +133,7 @@ automatically mean it is also continuous.

Measuring instruments can have different behaviours when they fail. For
example, during power failure an internal clock could reset to "1970-01-01",
or the manufacturing date (say, "2021-01-01"). This leads to unpredictable
or the manufacturing date (e.g. "2021-01-01", this is common behaviour for Raspberry Pi's). This leads to unpredictable
ways of checking if a dataset is continuous.

```{r rain_gauge_data}
Expand Down Expand Up @@ -211,11 +217,13 @@ To prevent writing different ways of checking for this depending on the instrume

### Variable measurement frequencies

In other cases, a non-continuous timeseries is intentional, for example when there is temporal variability in the measurements taken depending on events. At BAS, we collect data from a penguin weighbridge on weighbridge on Bird Island, South Georgia. This weighbridge measure weight on two different load cells (scales) to determine penguin weight and direction.
In other cases, a non-continuous timeseries is intentional, for example when there is temporal variability in the measurements taken depending on events. At BAS, we collect data from a penguin weighbridge on Bird Island, South Georgia. This weighbridge measure weight on two different load cells (scales), one on the colony side and one on the ocean side, to determine penguin weight and the direction they came from. The idea is that we may be able to derive information on their diet, as they return from the ocean to the colony.

You can read about this work in more detail in [Afanasyev et al. (2015)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0126292), but the important point here is that the weighbridge does not collect continuous measurements. In a remote, off-grid location this would drain batteries far too quickly.

You can read about this work in more detail in [Afanasyev et al. (2015)](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0126292), but the important point here is that the weighbridge does not collect continuous measurement. When no weight is detected on the load cells, it only samples at 1hz, but as soon as any change in weight is detected it will start collecting data at 100hz. This is of course intentional, to reduce the sheer volume of data we need to process, but also has another benefit in isolating (or attempting to) individual crossings.
Therefore, when no weight is detected on the load cells it only samples at 1hz, but as soon as any change in weight is detected it will start collecting data at 100hz. This is of course intentional, to reduce the sheer volume of data we need to process, but also has another benefit in isolating (or attempting to) individual crossings.

The individual crossings are the most valuables pieces of data, as these allow us to deduce some sort of information like weight, direction (from colony to sea, or sea to colony) and hopefully ultimately, diet.
The individual crossings are the most valuables pieces of data, as these allow us to deduce information on weight, direction and ultimately, diet.

In this case separating distinct, but continuous segments of data is required. This is the reasoning behind `timeline_group()`. This function allows us to split our timeseries in groups of individual crossings.

Expand Down

0 comments on commit f68179f

Please sign in to comment.