Skip to content

Method for hierarchical and chase clustering of biogeographic datasets in R

License

Notifications You must be signed in to change notification settings

BWBrook/chase-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chase-clustering

This repository contains data and R scripts for clustering analyses of Quaternary mammal communities, focusing on both a novel iterative “chase” algorithm and a spatially constrained Ward’s hierarchical approach. With these scripts, you can replicate the results of:

Brook, B.W. et al. (2025). Late Pleistocene faunal community patterns disrupted by Holocene human impacts. EcoEvoRxiv.

Contents

  • data/
    CSV files storing the fossil mammal occurrence data and site coordinates:

    • pleistocene_all.csv
    • holocene_all.csv
    • holocene_wild.csv
    • all_site_coords.csv
    • pleistocene_paired.csv
    • holocene_paired.csv
    • paired_site_coords.csv
  • src/
    Five R scripts are available to replicate all analyses described in the paper:

    1. qm_clustchase_func.r – Function library, implements the chase clustering algorithm, with iterative improvement of cluster assignments based purely on compositional similarity.
    2. qm_clustchase_main.r – Runs the chase clustering algorithm, including data filtering, on the datasets.
    3. qm_clustgeo.r – Applies Ward’s hierarchical clustering with spatial constraints (via ClustGeo) to illustrate the role of geography in shaping community structure.
    4. qm_domesticates.r – Uses ARI metric and summary statistics to examine the impact of domesticated species on Holocene composition.
    5. qm_turnover_paired_sites.r – Calculates site-level turnover for paired localities that have both Late Pleistocene and Holocene faunal data, displaying shifts in species composition through time.

Requirements

  • R version: 4.0 or higher (tested on 4.4.2)
  • R packages:
    • dplyr (data wrangling)
    • magrittr (pipe operator %>%)
    • tibble (row/column manipulation)
    • ggplot2 (plotting)
    • sf or sp/rgdal (for mapping)
    • maps or equivalent (for quick world map polygons)
    • ClustGeo (for Ward’s hierarchical + spatial constraints)
    • Additional packages used may be loaded via import::from statements in the scripts

Make sure these packages are installed, for example:

install.packages(c("dplyr", "magrittr", "tibble", "ggplot2", "sf", "maps", "ClustGeo"))

(This list may not be exhaustive; check the script headers for any others.)

Usage

  1. Clone or download this repository to your local machine.

  2. Open R (or RStudio) and set your working directory to the cloned folder:

    setwd("path/to/chase-clustering")  # adapt as needed
  3. Run the scripts in the /src folder in whichever order suits your interest:

  4. Inspect outputs: each script produces console output, plots (in the R graphics window), and optionally saves figures to disk (.png or .pdf).

You may edit script parameters (e.g., max_clusters, min_sp, etc.) to explore different settings.

Data Description

All CSV files in data/ are read by the R scripts to replicate the analyses in the forthcoming paper:

  • pleistocene_all.csv, holocene_all.csv and holocene_all.csv: large presence/absence (or counts) matrices (rows = species, columns = sites).
  • all_site_coords.csv: coordinates (longitude, latitude) and other metadata for each site in the full dataset.
  • pleistocene_paired.csv, holocene_paired.csv, paired_site_coords.csv: smaller subset of 34 “paired” sites that have data in both periods, used for turnover analyses.

Replicating the Study

  1. Data Filtering: scripts automatically exclude sites below a certain species threshold (usually 5) to reduce noise.
  2. Clustering:
    • Chase method (purely compositional) is iterative and random, so results can vary slightly run to run. Use large number of max_starts and max_shuffles to stabilise.
    • Ward’s method merges sites by minimizing a composite distance of compositional and geographic information (from ClustGeo).
  3. Turnover: For the paired sites, a straightforward index of compositional change between Late Pleistocene and Holocene is computed, and a map is generated to illustrate which sites underwent the greatest faunal shifts.

Refer to the inline comments in each script for detailed steps and explanations.

Citation

If you use this repository, code, or data in your own work, please cite:

  • The forthcoming paper:
    Brook, B.W. et al. (2025). Late Pleistocene faunal community patterns disrupted by Holocene human impacts. EcoEvoRxiv. (Exact details will be added here once published.)

  • This repository:

    @misc{brook2025chaseclustering,
      author = {Brook, B.W. and others},
      title = {chase-clustering},
      howpublished = {\url{https://github.com/bwbrook/chase-clustering}},
      year = {2025}
    }
    

License

This project is released under the CC0-1.0 License. See LICENSE for details. Data and code are provided “as is,” without warranty of any kind.

Contributing

Pull requests and issue reports are welcome. For major changes, please open an issue first to discuss proposed modifications.

Contact

For questions or collaboration inquiries, please contact Barry Brook at [email protected].


Please explore, adapt, and share. We hope these methods facilitate further research into Quaternary mammal community patterns and beyond.

About

Method for hierarchical and chase clustering of biogeographic datasets in R

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages