Difference between revisions of "REO Project Findings Cluster"

From Analytics Practicum
Jump to navigation Jump to search
 
(7 intermediate revisions by the same user not shown)
Line 38: Line 38:
 
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_EDA | <font color="#053B6B">Exploratory Data Analysis</font>]]
 
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_EDA | <font color="#053B6B">Exploratory Data Analysis</font>]]
  
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: #404040; width:30%" | [[REO_Project_Findings_Cluster | <font color="#ffffff">Cluster Analysis</font>]]
+
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: #404040; width:30%" | [[REO_Project_Findings_Cluster | <font color="#ffffff">Clustering Analysis</font>]]
 
|}
 
|}
  
 
<br>
 
<br>
  
Having explored the differences between paid and free users, the team decides to perform a cluster analysis via an adapted approach suggested by Punj and Stewart (1983). The approach is:
+
==K-Means Clustering==
1. K-nearest neighbour algorithm to detect outliers
+
The k-means clustering algorithm is an “iterative algorithm that partitions the observations”. The user input a pre-determined k value, which is the number of partitions. The algorithm begins by first selecting random observations as the starting centroids, then allocate nearby observations to the nearest centroid based on Euclidean distance. A new centroid is calculated from the center of each cluster before each observation is reallocated to the new centroid. The algorithm iterates until either convergence occurs, where there are little to no change in clusters or the maximum number of iteration has occurred.
2. Calculating Cubic Clustering Criterion (CCC) to decide on number of clusters  
+
<br>The usual process begins with selecting a pre-determined K as the number of clusters. In JMP Pro, the option of selecting a range of clusters was available and the program picks out the most optimal K using the CCC. Based on the CCC, the optimal number of clusters suggested is 9 with the value peaking at 225.113 <br>
3. K-means to obtain final cluster solution
+
[[File:REO_fig6.png|300px]]<br>
The team hopes to identify segments within the paid or free users such that REO can formulate strategies for each segment to effectively engage them.
+
[[File:REO_fig7.png|300px]]
The team decides to exclude entries that have zero enquiries and cobroke requests as enquiries and cobroke requests are the tangible outputs from the platform. Thus, these users have not received any outputs from the platform for the last 6 months which may not be relevant for our cluster analysis.
+
<br>Although the cluster membership appears to be uneven, this is a drastic improvement over the previous iteration. Clusters 3 and 6 are the dominant clusters with their membership exceeding 3000 observations.<br>
In addition, the technique of k-means clustering only include entries that are fully filled. As such, the team needs to perform missing value imputation by replacing missing data with zero for the analysis to run.
+
[[File:REO_fig8.png|300px]]
 +
<br>The biplot of the clusters indicated several overlaps of the clusters. Clusters 3 and 6 appears to have huge overlaps while there are slight overlaps between 1, 5 and 8.
 +
<br>Such overlaps exist could be due to the issues of high proportion of zero and extreme outliers not being resolved. To resolve the issue of overlapping clusters, the normal mixtures clustering method was used. The optimal K which was developed using the regular K-Means Clustering technique was used as the input K value for normal mixtures. Using JMP Pro, the option to identify outliers and classify them as Cluster 0 was enabled.<br>
 +
[[File:REO_fig9.png|800px]]
 +
<br>Compared to K-Means Analysis, the normal mixtures method was able to resolve the overlapping cluster 3 and 6 by identifying them as one and the same cluster. However, the resulting clusters are not stable as seen by the differing results when the same analysis was conducted thrice. Furthermore, the membership size of the clusters is still largely uneven with overlapping clusters. Therefore, analysis should be done on the output of K-Means Analysis as the results are more stable.<br>
 +
[[File:REO_fig10.png|400px]]
 +
<br>The parallel coordinate plot indicated the various profiles of the segments. The algorithm was able to develop good individual profiles for each of the clusters except for cluster 3 and 6, which appears to exhibit similar characteristics. The previous biplot did suggested that they might be similar due to the overlaps. Despite the slight overlaps for Clusters 1, 5 and 8, the parallel coordinate plot showed that these clusters are different. Therefore, it is recommended that cluster 3 and 6 should be merged to form one cluster.
 +
<br>Aside from clusters 1 and 8, membership of the various clusters appears to be influenced by the type of membership as well due to the dominance of a single class in the clusters.<br>
 +
[[File:REO_fig11.png|300px]]
 +
<br><b>Profiling of Clustering Results</b><br>
 +
[[File:REO_table5.png|600px]]
 +
<br>The clustering analysis was able to develop differentiated profiles. However, it is regrettable that the distribution of the data was not ideal. The datasets are severely skewed due to huge proportions of 0 in all the variables. After filtering out users who have no activity over the past six months, there are approximately only 60% of the initial observations left. Although it is possible to derive stricter criterions to interpret the data and eliminate more skewness, the amount of observations removed would have been unreasonable. For example, if users without at least a session per month are removed, almost half of all observations would have been eliminated.
 +
<br>In addition, the number of observations in each cluster is not balanced. As seen from above, there are 2 dominant group with size of 30.5% and 38.0% while there is sparse membership for two clusters at 0.4% and 1.0%. Cluster size should be generally similar across different clusters so that the effect of segment targeted strategies implemented would be more substantial.
 +
<br>While the k-means clustering algorithm was useful with providing profiles for further business analysis, other types of cluster analysis can be used to segment users more effectively. As such, we would be exploring latent class analysis.
  
==Paid Users==
+
==Latent Class Analysis==
After filtering, there are 4749 users who have paid for the platform and have received at least one enquiry or cobroke request.
+
<b>Discretization of Continuous Variables</b>
===Decision on Clustering Variables===
+
<br>To prepare for latent class analysis, the continuous variables need to first be converted into discrete variables. After placing the zeroes into a bin on its own, each continuous variable is then discretized into quartile via the Interactive Binning add-on on JMP. As the fifth bin consists of the extreme outliers as seen below, this would also help to ameliorate the problem of skewness.<br>
The team has chosen to use key activity indicators from the previous analysis which include the number of sessions, number of listings posted, number of enquiries and number of Cobroke requests received. The number of sessions and listings are then further broken down into weekday and weekend as well as each individual timeslot.
+
[[File:REO_fig12.png|300px]]<br>
[[File:REO_Clustering_Variables.png|500px]]
+
[[File:REO_table6.png|500px]]
 +
<br><b>Choice of Clusters</b><br>
 +
With the binned scoring classifications, latent class analysis can be performed to determine the clusters. To determine the number of clusters to be used, a selection of 3 to 10 clusters was chosen to determine the best fit of the data into different classes. The Bayesian Information Criteria (BIC) was looked at in order to determine the best model fit. The chosen model is identified by the minimum value of BIC. From the results, the Bayesian information criteria (BIC) was looked at for three to ten clusters and the lowest value was determined to be the model with the best fit. From the table below, we could see that the latent class analysis with six clusters provided the best fit with a BIC value of 121634 and thus, we decided to use the latent class analysis with six clusters and we would be profiling them shortly.<br>
 +
[[File:REO_table7.png|300px]]
 +
<br><b>Discussion on Latent Class Analysis Results</b><br>
 +
[[File:REO_fig13.png|600px]]
 +
<br><b>Interpretation of LCA</b><br>
 +
[[File:REO_table8.png|600px]]<br>
 +
[[File:REO_fig14.png|300px]]
 +
<br> There are paid users in each cluster and that Cluster 3, 5 and 6 have significantly more paid users than free users which is aligned with the interpretation from the table above
  
===Examining Clustering Variables===
+
===Profiling of Clustering Results===
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.
+
[[File:REO_table9.png|600px]]
[[File:REO_paid_Descriptive.png|700px]]
+
<br><b>Comparison of Observations between K-Means Clusters and LCA Clusters</b>
<br>Through identifying extreme outliers that are significantly different from the other points, the team has excluded rows 1328, 3229, 4727 and 6442.
+
Through Mosaic Plot, we can identify the similarities and differences for how the users are categorized into clusters via different techniques.<br>
 +
[[File:REO_fig15.png|500px]]
 +
<br>The observations clustered using the two different techniques. Based on the mosaic plot, LCA Clusters 1, 2 and 4 exhibited the dominant presence of certain K-Means clusters. As the techniques differ in how the observations are assigned – Euclidean distance vs probability, the differences with the assignment of clusters were expected.
  
===Transforming the Variables===
+
===Implications===
Given that the variables are positively skewed, the team has decided to transform the variable through applying cube root. Logarithmic transformation is not used because there are up to 50% of zero entry in “ORGANIC” and “Cobroke_total” variables. However, we noticed that both variables mentioned previously are still slightly skewed because of the number of zeroes. <br>
+
Although both clustering analysis gave reasonably good clusters, the eventual usage of the clusters should be based on the nature of the data and the objective of the research. For REO, the highly skewed distribution of data made it difficult while performing k-means clustering. Hence, the binning process during Latent Class helped to reduce impact of the high proportion of zeroes and extreme outliers. Through creating more even cluster sizes, latent class analysis would help ensure the effect of segment targeted strategies implemented to be more substantial. Therefore, latent class analysis may be more suitable in developing strategies of substantial effect that would reduce attrition rate and hopefully enlarge the potential user base.<br>
[[File:REO_paid_Transformation.png|700px]]
+
<br>Through the identification of clusters, REO can learn the characteristics of each cluster and develop strategies to increase engagement for each cluster. This debunks REO’s previous assumptions. For example, they would think that most of the paid users are benefiting from the portal but there are significantly high number of subscribers who are either under-performing or have abandoned the platform. They are also not aware of agents who constantly logging in without using other functions on the portal.
 
+
<br><br>Cluster analysis is helpful in identifying customer segments and developing segment specific strategies. In this case, these strategies will increase engagement of users with the portal. Latent class analysis is more effective than k-means clustering in forming distinct clusters of similar sizes. As the profile of agents could change in the future, REO can continually evolve and adapt this process to classify new users.
===K-Means Clustering===
 
The cubic clustering criterion (CCC) is used to estimate the number of clusters using Ward’s minimum within-cluster variance method.<br>
 
[[File:REO_paid_CCC.png|500px]]
 
<br>Based on the graph above, the number of clusters decided for k is 5 clusters.
 
 
 
===Profiling of Clusters===
 
[[File:REO_paid_parallel.png|500px]]
 
<br>The z-score ranking method is a way to profile the clusters. Z-score is calculated to understand how many standard deviation the cluster’s variable mean is away from the variable mean. A positive number refers to being above the variable mean while a negative number refers to being below the variable mean. <br>
 
[[File:REO_paid_zscore.png|500px]]
 
[[File:REO_paid_graph.png|500px]]
 
[[File:REO_paid_profile.png|700px]]
 
 
 
==Free Users==
 
After filtering out, there are 5583 users who use the platform for free but receive at least one enquiry or cobroke request.
 
 
 
===Decision on Clustering Variables===
 
[[File:REO_Clustering_Variables.png|500px]]
 
 
 
===Examining Clustering Variables===
 
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness. <br>
 
[[File:REO_free_Descriptive.png|700px]]
 
<br>The team decides to exclude row 10583 because it is an extreme outlier.
 
 
 
===Transforming the Variables===
 
Similarly, we performed a cube-root transformation for all clustering variables.
 
[[File:REO_free_Transformation.png|700px]]
 
 
 
===Screening Outliers===
 
K-nearest neighbours algorithm is a form of hierarchical clustering to identify any outliers. The team decides to exclude 11 rows with the distance to 1 closest above 4.0. <br>
 
[[File:REO_free_Outlier.png|500px]]
 
 
 
===K-Means Clustering===
 
[[File:REO_free_CCC.png|500px]]
 
<br>Based on the maximum point for cubic clustering criterion, the optimal number of clusters is 7.
 
 
 
 
 
===Profiling of Clusters===
 
[[File:REO_free_parallel.png|700px]]
 
<br>Similarly, we calculated the z-score for individual variable for each cluster.
 
[[File:REO_free_zscore.png|500px]]
 
[[File:REO_free_graph.png|500px]]
 
[[File:REO_free_profile.png|700px]]
 

Latest revision as of 15:24, 16 April 2018


Back to ANLY482 AY2017-18 Home Page

HOME

 

ABOUT US

 

PROJECT PROPOSAL

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 


Data Preparation Exploratory Data Analysis Clustering Analysis


K-Means Clustering

The k-means clustering algorithm is an “iterative algorithm that partitions the observations”. The user input a pre-determined k value, which is the number of partitions. The algorithm begins by first selecting random observations as the starting centroids, then allocate nearby observations to the nearest centroid based on Euclidean distance. A new centroid is calculated from the center of each cluster before each observation is reallocated to the new centroid. The algorithm iterates until either convergence occurs, where there are little to no change in clusters or the maximum number of iteration has occurred.
The usual process begins with selecting a pre-determined K as the number of clusters. In JMP Pro, the option of selecting a range of clusters was available and the program picks out the most optimal K using the CCC. Based on the CCC, the optimal number of clusters suggested is 9 with the value peaking at 225.113
REO fig6.png
REO fig7.png
Although the cluster membership appears to be uneven, this is a drastic improvement over the previous iteration. Clusters 3 and 6 are the dominant clusters with their membership exceeding 3000 observations.
REO fig8.png
The biplot of the clusters indicated several overlaps of the clusters. Clusters 3 and 6 appears to have huge overlaps while there are slight overlaps between 1, 5 and 8.
Such overlaps exist could be due to the issues of high proportion of zero and extreme outliers not being resolved. To resolve the issue of overlapping clusters, the normal mixtures clustering method was used. The optimal K which was developed using the regular K-Means Clustering technique was used as the input K value for normal mixtures. Using JMP Pro, the option to identify outliers and classify them as Cluster 0 was enabled.
REO fig9.png
Compared to K-Means Analysis, the normal mixtures method was able to resolve the overlapping cluster 3 and 6 by identifying them as one and the same cluster. However, the resulting clusters are not stable as seen by the differing results when the same analysis was conducted thrice. Furthermore, the membership size of the clusters is still largely uneven with overlapping clusters. Therefore, analysis should be done on the output of K-Means Analysis as the results are more stable.
REO fig10.png
The parallel coordinate plot indicated the various profiles of the segments. The algorithm was able to develop good individual profiles for each of the clusters except for cluster 3 and 6, which appears to exhibit similar characteristics. The previous biplot did suggested that they might be similar due to the overlaps. Despite the slight overlaps for Clusters 1, 5 and 8, the parallel coordinate plot showed that these clusters are different. Therefore, it is recommended that cluster 3 and 6 should be merged to form one cluster.
Aside from clusters 1 and 8, membership of the various clusters appears to be influenced by the type of membership as well due to the dominance of a single class in the clusters.
REO fig11.png
Profiling of Clustering Results
REO table5.png
The clustering analysis was able to develop differentiated profiles. However, it is regrettable that the distribution of the data was not ideal. The datasets are severely skewed due to huge proportions of 0 in all the variables. After filtering out users who have no activity over the past six months, there are approximately only 60% of the initial observations left. Although it is possible to derive stricter criterions to interpret the data and eliminate more skewness, the amount of observations removed would have been unreasonable. For example, if users without at least a session per month are removed, almost half of all observations would have been eliminated.
In addition, the number of observations in each cluster is not balanced. As seen from above, there are 2 dominant group with size of 30.5% and 38.0% while there is sparse membership for two clusters at 0.4% and 1.0%. Cluster size should be generally similar across different clusters so that the effect of segment targeted strategies implemented would be more substantial.
While the k-means clustering algorithm was useful with providing profiles for further business analysis, other types of cluster analysis can be used to segment users more effectively. As such, we would be exploring latent class analysis.

Latent Class Analysis

Discretization of Continuous Variables
To prepare for latent class analysis, the continuous variables need to first be converted into discrete variables. After placing the zeroes into a bin on its own, each continuous variable is then discretized into quartile via the Interactive Binning add-on on JMP. As the fifth bin consists of the extreme outliers as seen below, this would also help to ameliorate the problem of skewness.
REO fig12.png
REO table6.png
Choice of Clusters
With the binned scoring classifications, latent class analysis can be performed to determine the clusters. To determine the number of clusters to be used, a selection of 3 to 10 clusters was chosen to determine the best fit of the data into different classes. The Bayesian Information Criteria (BIC) was looked at in order to determine the best model fit. The chosen model is identified by the minimum value of BIC. From the results, the Bayesian information criteria (BIC) was looked at for three to ten clusters and the lowest value was determined to be the model with the best fit. From the table below, we could see that the latent class analysis with six clusters provided the best fit with a BIC value of 121634 and thus, we decided to use the latent class analysis with six clusters and we would be profiling them shortly.
REO table7.png
Discussion on Latent Class Analysis Results
REO fig13.png
Interpretation of LCA
REO table8.png
REO fig14.png
There are paid users in each cluster and that Cluster 3, 5 and 6 have significantly more paid users than free users which is aligned with the interpretation from the table above

Profiling of Clustering Results

REO table9.png
Comparison of Observations between K-Means Clusters and LCA Clusters Through Mosaic Plot, we can identify the similarities and differences for how the users are categorized into clusters via different techniques.
REO fig15.png
The observations clustered using the two different techniques. Based on the mosaic plot, LCA Clusters 1, 2 and 4 exhibited the dominant presence of certain K-Means clusters. As the techniques differ in how the observations are assigned – Euclidean distance vs probability, the differences with the assignment of clusters were expected.

Implications

Although both clustering analysis gave reasonably good clusters, the eventual usage of the clusters should be based on the nature of the data and the objective of the research. For REO, the highly skewed distribution of data made it difficult while performing k-means clustering. Hence, the binning process during Latent Class helped to reduce impact of the high proportion of zeroes and extreme outliers. Through creating more even cluster sizes, latent class analysis would help ensure the effect of segment targeted strategies implemented to be more substantial. Therefore, latent class analysis may be more suitable in developing strategies of substantial effect that would reduce attrition rate and hopefully enlarge the potential user base.

Through the identification of clusters, REO can learn the characteristics of each cluster and develop strategies to increase engagement for each cluster. This debunks REO’s previous assumptions. For example, they would think that most of the paid users are benefiting from the portal but there are significantly high number of subscribers who are either under-performing or have abandoned the platform. They are also not aware of agents who constantly logging in without using other functions on the portal.

Cluster analysis is helpful in identifying customer segments and developing segment specific strategies. In this case, these strategies will increase engagement of users with the portal. Latent class analysis is more effective than k-means clustering in forming distinct clusters of similar sizes. As the profile of agents could change in the future, REO can continually evolve and adapt this process to classify new users.