Difference between revisions of "Uncovering Market-Insights for Charles & Keith: K-Means Clustering"

From Analytics Practicum
Jump to navigation Jump to search
Line 48: Line 48:
 
[[Image:AYEClustering.jpg|900px|center|AYE Cluster]]
 
[[Image:AYEClustering.jpg|900px|center|AYE Cluster]]
  
<div style="border-style: solid solid none; border-color: #35383c; border-width: 1px 1px; padding: 5px; font-size: 120%; font-weight: bold; background-color: #{{LibreOfficeColor2}}; color: #{{LibreOfficeColor3}}; border-radius: 3px 3px 0 0;">Conversion Rate for Bags and Shoes</div>
+
<div style="border-style: solid solid none; border-color: #35383c; border-width: 1px 1px; padding: 5px; font-size: 120%; font-weight: bold; background-color: #{{LibreOfficeColor2}}; color: #{{LibreOfficeColor3}}; border-radius: 3px 3px 0 0;">Approach</div>
 
<div style="border: 1px solid #35383c; padding: 15px 15px 20px; border-radius: 0 0 3px 3px;">
 
<div style="border: 1px solid #35383c; padding: 15px 15px 20px; border-radius: 0 0 3px 3px;">
 +
In order to view the stores with more insights of the business, we decided to perform clustering analysis to group the stores based on consumer’s purchasing patterns (i.e, if customer will purchase single item bag or two items shoes). The various combinations will be the clustering variables. This enables us to gain a deeper understanding of the characteristics of each store instead of the original grouping based solely on geographical regions.
 +
 +
K-means clustering is selected as the method for this analysis as it is a partitioning method as it is efficient and robust with large dataset and its algorithm consists of iteration steps that maintain a set of clusters and place points into closest cluster. Hierarchical clustering is not chosen due to its time complexity that is quadratic, with a O(n2), while K-means clustering being linear O(n).
 +
</div>
 +
 +
<div style="border-style: solid solid none; border-color: #35383c; border-width: 1px 1px; padding: 5px; font-size: 120%; font-weight: bold; background-color: #{{LibreOfficeColor2}}; color: #{{LibreOfficeColor3}}; border-radius: 3px 3px 0 0;">Data Preparation</div>
 +
<div style="border: 1px solid #35383c; padding: 15px 15px 20px; border-radius: 0 0 3px 3px;">
 +
The initial dataset has been processed into the format for clustering purposes. The clustering variables are the various combinations of consumer purchases based on the transaction. The total transactions in each store are then tabulated based on these combinations.  Data is then transformed using the formula below
 +
 +
"% Variable A= [No. of Transactions for Variable A in Store 1/ Total No. of Transaction in Store 1 ] x 100%"
 +
 +
</div>
 +
 +
 +
<div style="border-style: solid solid none; border-color: #35383c; border-width: 1px 1px; padding: 5px; font-size: 120%; font-weight: bold; background-color: #{{LibreOfficeColor2}}; color: #{{LibreOfficeColor3}}; border-radius: 3px 3px 0 0;">Analysis</div>
 +
<div style="border: 1px solid #35383c; padding: 15px 15px 20px; border-radius: 0 0 3px 3px;">
 +
The results reflect an optimal cluster number to be 7, based on the lowest cubic clustering criterion (CCC), which -4.2442. Profiling of the 7 clusters has been done with account of the similarities within cluster members and the differences between clusters. Z-score and parallel plots have been utilised to aid the profiling of clusters.
 +
[[File:AYEKMEANSAnalysis.JPG|500px|centre|summary K means]]
 
</div>
 
</div>

Revision as of 20:37, 17 April 2016

ANALYSIS.jpg
HOME   OVERVIEW   DATA PREPARATION   ANALYSIS   PROJECT MANAGEMENT   DOCUMENTATION

Due to the confidentiality of the data provided by our sponsor, we would be only showing the methods of analysis without the results. For authorised stakeholders, please refer to our report for more in-depth analysis with charts and descriptions.


TOOLS USED

EDA PHASE 1

EDA PHASE 2

CLUSTERING

MBA

AYE Cluster
Approach

In order to view the stores with more insights of the business, we decided to perform clustering analysis to group the stores based on consumer’s purchasing patterns (i.e, if customer will purchase single item bag or two items shoes). The various combinations will be the clustering variables. This enables us to gain a deeper understanding of the characteristics of each store instead of the original grouping based solely on geographical regions.

K-means clustering is selected as the method for this analysis as it is a partitioning method as it is efficient and robust with large dataset and its algorithm consists of iteration steps that maintain a set of clusters and place points into closest cluster. Hierarchical clustering is not chosen due to its time complexity that is quadratic, with a O(n2), while K-means clustering being linear O(n).

Data Preparation

The initial dataset has been processed into the format for clustering purposes. The clustering variables are the various combinations of consumer purchases based on the transaction. The total transactions in each store are then tabulated based on these combinations. Data is then transformed using the formula below

"% Variable A= [No. of Transactions for Variable A in Store 1/ Total No. of Transaction in Store 1 ] x 100%"


Analysis

The results reflect an optimal cluster number to be 7, based on the lowest cubic clustering criterion (CCC), which -4.2442. Profiling of the 7 clusters has been done with account of the similarities within cluster members and the differences between clusters. Z-score and parallel plots have been utilised to aid the profiling of clusters.

summary K means