Uncovering Market-Insights for Charles & Keith: K-Means Clustering

From Analytics Practicum
Jump to navigation Jump to search
ANALYSIS.jpg
HOME   OVERVIEW   DATA PREPARATION   ANALYSIS   PROJECT MANAGEMENT   DOCUMENTATION

Due to the confidentiality of the data provided by our sponsor, we would be only showing the methods of analysis without the results. For authorised stakeholders, please refer to our report for more in-depth analysis with charts and descriptions.


TOOLS USED

EDA PHASE 1

EDA PHASE 2

CLUSTERING

MBA

AYE Cluster
Approach

In order to view the stores with more insights of the business, we decided to perform clustering analysis to group the stores based on consumer’s purchasing patterns (i.e, if customer will purchase single item bag or two items shoes). The various combinations will be the clustering variables. This enables us to gain a deeper understanding of the characteristics of each store instead of the original grouping based solely on geographical regions.

K-means clustering is selected as the method for this analysis as it is a partitioning method as it is efficient and robust with large dataset and its algorithm consists of iteration steps that maintain a set of clusters and place points into closest cluster. Hierarchical clustering is not chosen due to its time complexity that is quadratic, with a O(n2), while K-means clustering being linear O(n).


Data Preparation

The initial dataset has been processed into the format for clustering purposes. The clustering variables are the various combinations of consumer purchases based on the transaction. The total transactions in each store are then tabulated based on these combinations.

The 9 clustering variables consists of

  1. % Shoes (1): % of single shoes purchases
  2. % Bags (1): % of single bag purchases
  3. % Shoes (2): % of 2 shoes purchases per transaction
  4. % Bags (2): % of 2 bags purchases per transaction
  5. % Shoes (1) and Bags (1): % of 1 shoes & 1 Bag purchases
  6. % Shoes (3): % of 3 shoes purchases
  7. % Shoes (2) and Bags (1): % of 2 shoes and 1 bag transaction
  8. % Shoes (1) and Bags (2): % of 1 shoes and 2 bags transaction
  9. % Bags (3): % of 3 bags purchases

Data is then transformed using the formula below

"% Variable A= [No. of Transactions for Variable A in Store 1/ Total No. of Transactions in Store 1 ] x 100%"

This transformation is required to standardise across the stores with varying stores’ performance and years of operation. The range will be kept within 0 to 100.


Analysis

The results reflect an optimal cluster number to be 7, based on the lowest cubic clustering criterion (CCC), which -4.2442. Profiling of the 7 clusters has been done with account of the similarities within cluster members and the differences between clusters. Z-score and parallel plots have been utilised to aid the profiling of clusters.


summary K means


Cluster 1: The stores in C1 are performing well in transactions that contain shoes with regards to the number of item transaction (i.e single or multiple item transactions) as seen in the high Z-score in clustering variables with shoes, Z score of “%Shoes (1)” is 1.96, “%Shoes (2)” is 1.71, “%Shoes (3)” is 1.46. The parallel plot for C1 has clear peaks with variables that indicate shoes purchases.


Cluster 2: The stores in C2 are performing better in transactions that contain bags with regards to the number of item transaction (i.e single or multiple item transactions) as compared to the other clusters. The Z score for “%Bags (1)” and “%Bags (2)” is more than 1.


Cluster 3: The stores in C3 performed well only in multiple items transactions that contain shoes (E.g 2 items transactions with 2 shoes). The more prominent clustering variables are “%Shoes(2)” and “%Shoes(3)” as its Z-score is approximately 1.1 for both.


Cluster 4: The stores in C3 performed well only in single items transactions that contain shoes. All the clustering variables are either negative or near 0 for the Z-score, except for “%Shoes (1)” where its Z-score is 1.12. This shows that this particular clustering variable contributed mostly to its cluster formation.


Cluster 5: The stores in C5 performed reasonably well only in single items transactions that contain bags with a Z-score of 0.72 for the clustering variable “%Bags (1)”. The other variables are either negative or near 0 for the Z-score.


Cluster 6: The stores in C6 did not perform significantly in any of the clustering variables, with only a slight above 0 Z-score for “%Shoes (1)” and “%Bags (1)”. This means that the C6 stores are the average performing ones.


Cluster 7: The stores in C7 are performed significantly well in multiple items transaction, with regard of the categories of the item purchases. The Z-score for all the clustering variables are above 1 , except for “%Shoes (1)” and “%Bags (1)”. This indicates that C7 stores are the high performing ones with multiple item transaction. The parallel plot also suggests that these stores are high in multiple item transactions.

Moreover, we have mapped the store clusters with the original classification to observe trends.