The project focuses on segmenting customers using unsupervised machine learning K-means clustering.
Through clustering analysis utilizing unsupervised machine learning techniques, specifically k-means clustering, we uncovered 6 distinct segments characterized by specific behaviors and attributes. This enables the formulation of targeted marketing strategies tailored to each segment's unique needs.
In the analysis process, I followed three critical steps to ensure robust and accurate clustering results:
-
Explorative Data Analysis (EDA):To extract, transform, and prepare data for analysis.
-
K-means Clustering Optimization: optimization of hyperparameters to achieve the best clustering results with Scree plot (also known as an Elbow plot). Then I calculate Silhouette scores and visualized it to validate the stability of the clusters.
-
Segmentation Results and Actionable Insights: visualized the 6 clusters using a 3D plot, assigned meaningful names to each group and highlight key insights and recommendations for the top 4 customer segments.
Visualizing data distributions, identifying outliers, and uncovering initial patterns that later aided in verifying the performance of the k-means clustering results and establishing criteria for effective segmentation.
3.1.2 The potential relationships within our dataset that we expect to observe after applying the k-means clustering model
Employed the optimized hyperparameters to perform the clustering. Subsequently, I visualized the 6 clusters using a 3D plot, assigned meaningful names to each group. To effectively communicate my findings, I used box plots and bar plots to highlight key insights and recommendations among the top 4 customer segments.
Optimization of hyperparameters to achieve the best clustering results. To address the unsupervised nature of this learning method, I used indirect metrics to confirm the results:
- Utilized Scree plot (also known as an Elbow plot) to determine the optimal number of clusters.
- Calculate Silhouette scores and visualized it with heatmap to choose the random state number.
- Plot Silhouette Score to ensure consistency and confirm that there were no wide fluctuations, thereby validating the stability of the clusters."
3.2.2 Generate heatmap to visualize Silhouette Score against different the number of clusters and random states
So far, we confirmed that n_clusters=6 is the optimal value, but the random_state numbers have many numbers with the same Silhouette score. Therefore, we calculate Silhouette scores with different number of clusters and random state numbers then visualize it to choose paramaters that have Silhouette as close to 1 as possible since it is measured within range of (-1, 1).
(The definition of Silhouette is explained in the plot)
3.2.3 Ensure the consistency and confirm that there were no wide fluctuation with the chosen hyperparameters
We will validate our choice by plotting Silhouette score against kmeans.labels_. The plot will be examined under these conditions: