clustering_analysis.qmd

---
title: "Cluster Analysis Experiments"
lang: en-GB
author: "Jacqueline Grout"
date: last-modified
date-format: "YYYY-MM-DD"
format:
  html:
    grid:
      sidebar-width: 250px
      body-width: 800px
      margin-width: 250px
      gutter-width: 1.5rem
    embed-resources: true
    smooth-scroll: true
    theme: cosmo
    fontcolor: black
    toc: true
    toc-location: left
    toc-title: Contents
    toc-depth: 3
editor: visual
execute:
  echo: false
  message: false
  warning: false
  freeze: auto
editor_options: 
  chunk_output_type: console
---

## 

```{r}
#| echo: false
# Read in the analysis RData file ----
library(factoextra)
library(cluster)
library(dplyr)
library(targets)
library(GGally)
library(gt)
library(tidyverse)
library(targets)

load("cluster_analysis_v1.Rdata")
```

## Looking for the "elbow"

### All Categories

```{r}
#| echo: false
full_cats_plot
```

::: panel-tabset
## Actual Numbers

```{r}
#| echo: false
full_cats_plot
```

## Percent

```{r}
#| echo: false
full_cats_percents_plot
```
:::

### Five Categories

```{r}
#| echo: false
five_cats_plot
```

::: panel-tabset
## Actual numbers

```{r}
#| echo: false
five_cats_plot
```

## Percent

```{r}
#| echo: false
five_cats_percents_plot
```
:::

## All Categories Gap Plot

Plot to show number of clusters vs. gap statistic for 15 clusters maximum after 50 iterations

```{r}
#| echo: false
full_cats_gap_plot
```

## Five Categories Gap Plot

Plot to show number of clusters vs. gap statistic for 15 clusters maximum after 50 iterations

```{r}
#| echo: false
five_cats_gap_plot
```

## Correlation Matrix

```{r}
#Correlation Matrix
ggcorr(full_cats,
       method = "pairwise",
       nbreaks = 10,
       hjust = 0.9,
       label = TRUE,
       label_size = 3,
       layout.exp = 3)+
  theme(legend.position = c(0.25,0.65))
```

## Scree Plots

::: panel-tabset
## All categories

```{r}
pca <- princomp(scale_full_cats)

screeplot(pca, npcs=14)
```

## Excluding White British

```{r}
pca2 <- princomp(scale_full_cats_percents_ex_whitebritish)

screeplot(pca2, npcs=14)
```
:::

## Results of k-medoids model

All categories with 15 clusters

```{r}
#| echo: false
#plot results of final k-medoids model
fviz_cluster(pams_full_cats, data = full_cats)

```

Visualisation of the ethnic groups of the 15 medoid practices :

![](secret/Picture2.png){fig-align="center"}

Five categories with 4 clusters

```{r}
#| echo: false
#plot results of final k-medoids model
fviz_cluster(pams_five_cats, data = five_cats)
```

Visualisation of the ethnic groups of the 4 medoid practices :

![](secret/Picture1.png){fig-align="center"}

Looking at the 15 clusters visualisation for all categories by eye I wondered if 8 clusters made more sense:

Visualisation of the ethnic groups of the 8 medoid practices :

![](secret/Picture3.png){fig-align="center"}

### Alternative K-Medoid models

#### Experiment with removal of White British completely but leave all practices in the dataset

```{r}
pams_full_cats_percents_ex_whitebritish8_clusters_plot
```

```{r}
final_data_full_cats_percents_ex_whitebritish8_clusters |>
  rownames_to_column(var = 'practice_code') |>
  group_by(cluster)|>
  count("practice_code")|>
  select(cluster,n) |>
  gt()
```

![](secret/Picture4.png){fig-align="center"}

#### Experiment with removal of practice with \> 85% White British manually into their own group

```{r}
pams_full_cats_percents_ex_85whitebritish7_clusters_plot
```

![](secret/Picture5.png){fig-align="center"}

## Five clusters using 14 ethnicities

```{r}
tar_read(elbow_plot)
```

```{r}
tar_read(cluster_plot)
```

```{r}

final_data_full_cats_percent_5_clusters <- tar_read(final_data_full_cats_percent_5_clusters) |> as_tibble()

final_data_full_cats_percent_5_clusters |>
  group_by(cluster)|>
  count("practice_code")|>
  gt()
```

```{r}
tar_read(cluster2_chart)
```

```{r}
tar_read(cluster2_map)
```

## Five clusters using 14 ethnicities with % pop 45+

```{r}
tar_read(elbow_plot_over45)
```

```{r}
tar_read(cluster_plot_over45)
```

```{r}
final_data_full_cats_percent_over45_5_clusters <- tar_read(final_data_full_cats_percent_over45_5_clusters) |> as_tibble()

final_data_full_cats_percent_5_clusters |>
  group_by(cluster)|>
  count("practice_code")|>
  ungroup() |>
  select(cluster,n)|>
  gt()
```

```{r}
tar_read(cluster1_chart)
```

```{r}
tar_read(cluster1_map)
```