Difference between revisions of "REO Project Findings Cluster"

From Analytics Practicum
Jump to navigation Jump to search
Line 55: Line 55:
 
===Decision on Clustering Variables===
 
===Decision on Clustering Variables===
 
The team has chosen to use key activity indicators from the previous analysis which include the number of sessions, number of listings posted, number of enquiries and number of Cobroke requests received. The number of sessions and listings are then further broken down into weekday and weekend as well as each individual timeslot.
 
The team has chosen to use key activity indicators from the previous analysis which include the number of sessions, number of listings posted, number of enquiries and number of Cobroke requests received. The number of sessions and listings are then further broken down into weekday and weekend as well as each individual timeslot.
[[File:REO_Clustering_Variables.png]]
+
[[File:REO_Clustering_Variables.png|500px]]
  
 
===Examining Clustering Variables===
 
===Examining Clustering Variables===
 
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.
 
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.
[[File:REO_paid_Descriptive.png]]
+
[[File:REO_paid_Descriptive.png|500px]]
Through identifying extreme outliers that are significantly different from the other points, the team has excluded rows 1328, 3229, 4727 and 6442.
+
<br>Through identifying extreme outliers that are significantly different from the other points, the team has excluded rows 1328, 3229, 4727 and 6442.
  
 
===Transforming the Variables===
 
===Transforming the Variables===
 
Given that the variables are positively skewed, the team has decided to transform the variable through applying cube root. Logarithmic transformation is not used because there are up to 50% of zero entry in “ORGANIC” and “Cobroke_total” variables. However, we noticed that both variables mentioned previously are still slightly skewed because of the number of zeroes.
 
Given that the variables are positively skewed, the team has decided to transform the variable through applying cube root. Logarithmic transformation is not used because there are up to 50% of zero entry in “ORGANIC” and “Cobroke_total” variables. However, we noticed that both variables mentioned previously are still slightly skewed because of the number of zeroes.
[[File:REO_paid_Transformation.png]]
+
[[File:REO_paid_Transformation.png|500px]]
  
 
===K-Means Clustering===
 
===K-Means Clustering===
 
The cubic clustering criterion (CCC) is used to estimate the number of clusters using Ward’s minimum within-cluster variance method.
 
The cubic clustering criterion (CCC) is used to estimate the number of clusters using Ward’s minimum within-cluster variance method.
[[File:REO_paid_CCC.png]]
+
[[File:REO_paid_CCC.png|500px]]
Based on the graph above, the number of clusters decided for k is 5 clusters.
+
<br>Based on the graph above, the number of clusters decided for k is 5 clusters.
  
 
===Profiling of Clusters===
 
===Profiling of Clusters===
[[File:REO_paid_parallel.png]]
+
[[File:REO_paid_parallel.png|500px]]
The z-score ranking method<sup>4</sup> is a way to profile the clusters. Z-score is calculated to understand how many standard deviation the cluster’s variable mean is away from the variable mean. A positive number refers to being above the variable mean while a negative number refers to being below the variable mean.  
+
<br>The z-score ranking method is a way to profile the clusters. Z-score is calculated to understand how many standard deviation the cluster’s variable mean is away from the variable mean. A positive number refers to being above the variable mean while a negative number refers to being below the variable mean.  
[[File:REO_paid_zscore.png]]
+
[[File:REO_paid_zscore.png|500px]]
[[File:REO_paid_graph.png]]
+
[[File:REO_paid_graph.png|500px]]
[[File:REO_paid_profile.png]]
+
[[File:REO_paid_profile.png|500px]]
  
 
==Free Users==
 
==Free Users==
Line 82: Line 82:
  
 
===Decision on Clustering Variables===
 
===Decision on Clustering Variables===
[[File:REO_Clustering_Variables.png]]
+
[[File:REO_Clustering_Variables.png|500px]]
  
 
===Examining Clustering Variables===
 
===Examining Clustering Variables===
 
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.
 
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.
[[File:REO_free_Descriptive.png]]
+
[[File:REO_free_Descriptive.png|500px]]
The team decides to exclude row 10583 because it is an extreme outlier.
+
<br>The team decides to exclude row 10583 because it is an extreme outlier.
  
 
===Transforming the Variables===
 
===Transforming the Variables===
 
Similarly, we performed a cube-root transformation for all clustering variables.
 
Similarly, we performed a cube-root transformation for all clustering variables.
[[File:REO_free_Transformation.png]]
+
[[File:REO_free_Transformation.png|500px]]
  
 
===Screening Outliers===
 
===Screening Outliers===
 
K-nearest neighbours algorithm is a form of hierarchical clustering to identify any outliers. The team decides to exclude 11 rows with the distance to 1 closest above 4.0.
 
K-nearest neighbours algorithm is a form of hierarchical clustering to identify any outliers. The team decides to exclude 11 rows with the distance to 1 closest above 4.0.
[[File:REO_free_Outlier.png]]
+
[[File:REO_free_Outlier.png|500px]]
  
 
===K-Means Clustering===
 
===K-Means Clustering===
[[File:REO_free_CCC.png]]
+
[[File:REO_free_CCC.png|500px]]
Based on the maximum point for cubic clustering criterion, the optimal number of clusters is 7.
+
<br>Based on the maximum point for cubic clustering criterion, the optimal number of clusters is 7.
  
  
 
===Profiling of Clusters===
 
===Profiling of Clusters===
[[File:REO_free_parallel.png]]
+
[[File:REO_free_parallel.png|500px]]
Similarly, we calculated the z-score for individual variable for each cluster.
+
<br>Similarly, we calculated the z-score for individual variable for each cluster.
[[File:REO_free_zscore.png]]
+
[[File:REO_free_zscore.png|500px]]
[[File:REO_free_graph.png]]
+
[[File:REO_free_graph.png|500px]]
[[File:REO_free_profile.png]]
+
[[File:REO_free_profile.png|500px]]

Revision as of 00:38, 27 February 2018


Back to ANLY482 AY2017-18 Home Page

HOME

 

ABOUT US

 

PROJECT PROPOSAL

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 


Data Preparation Exploratory Data Analysis Cluster Analysis


Having explored the differences between paid and free users, the team decides to perform a cluster analysis via an adapted approach suggested by Punj and Stewart (1983). The approach is: 1. K-nearest neighbour algorithm to detect outliers 2. Calculating Cubic Clustering Criterion (CCC) to decide on number of clusters 3. K-means to obtain final cluster solution The team hopes to identify segments within the paid or free users such that REO can formulate strategies for each segment to effectively engage them. The team decides to exclude entries that have zero enquiries and cobroke requests as enquiries and cobroke requests are the tangible outputs from the platform. Thus, these users have not received any outputs from the platform for the last 6 months which may not be relevant for our cluster analysis. In addition, the technique of k-means clustering only include entries that are fully filled. As such, the team needs to perform missing value imputation by replacing missing data with zero for the analysis to run.

After filtering, there are 4749 users who have paid for the platform and have received at least one enquiry or cobroke request.

Decision on Clustering Variables

The team has chosen to use key activity indicators from the previous analysis which include the number of sessions, number of listings posted, number of enquiries and number of Cobroke requests received. The number of sessions and listings are then further broken down into weekday and weekend as well as each individual timeslot. REO Clustering Variables.png

Examining Clustering Variables

The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness. REO paid Descriptive.png
Through identifying extreme outliers that are significantly different from the other points, the team has excluded rows 1328, 3229, 4727 and 6442.

Transforming the Variables

Given that the variables are positively skewed, the team has decided to transform the variable through applying cube root. Logarithmic transformation is not used because there are up to 50% of zero entry in “ORGANIC” and “Cobroke_total” variables. However, we noticed that both variables mentioned previously are still slightly skewed because of the number of zeroes. REO paid Transformation.png

K-Means Clustering

The cubic clustering criterion (CCC) is used to estimate the number of clusters using Ward’s minimum within-cluster variance method. REO paid CCC.png
Based on the graph above, the number of clusters decided for k is 5 clusters.

Profiling of Clusters

REO paid parallel.png
The z-score ranking method is a way to profile the clusters. Z-score is calculated to understand how many standard deviation the cluster’s variable mean is away from the variable mean. A positive number refers to being above the variable mean while a negative number refers to being below the variable mean. REO paid zscore.png REO paid graph.png REO paid profile.png

Free Users

After filtering out, there are 5583 users who use the platform for free but receive at least one enquiry or cobroke request.

Decision on Clustering Variables

REO Clustering Variables.png

Examining Clustering Variables

The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness. REO free Descriptive.png
The team decides to exclude row 10583 because it is an extreme outlier.

Transforming the Variables

Similarly, we performed a cube-root transformation for all clustering variables. REO free Transformation.png

Screening Outliers

K-nearest neighbours algorithm is a form of hierarchical clustering to identify any outliers. The team decides to exclude 11 rows with the distance to 1 closest above 4.0. REO free Outlier.png

K-Means Clustering

REO free CCC.png
Based on the maximum point for cubic clustering criterion, the optimal number of clusters is 7.


Profiling of Clusters

REO free parallel.png
Similarly, we calculated the z-score for individual variable for each cluster. REO free zscore.png REO free graph.png REO free profile.png