Difference between revisions of "ANLY482 AY2016-17 T2 Group3: HOME/Final"

From Analytics Practicum
Jump to navigation Jump to search
Line 42: Line 42:
 
<div style="height: 1em"></div>
 
<div style="height: 1em"></div>
 
<div><font face="Open Sans">
 
<div><font face="Open Sans">
Add text here!
+
 
 +
=== Update ===
 +
Moving forward from the Interim, our team has continued our analysis. The following will show our progress and findings from the Cluster Analysis and Association Analysis that we carried out. For better understanding on the entire project, you may look at our Final Research Paper. Thank you!
 +
<br/><br/>
 +
 
 +
=== Cluster Analysis ===
 +
==== Clustering Variables ====
 +
After getting a clearer picture of the dataset through Exploratory Data Analysis, the next analysis that we will focus on is Cluster Analysis. The purpose of this analysis is to identify and segmentize current customers into distinct clusters of varying characteristics. First and foremost, we had to identify potential clustering variables from the Users table.  After analyzing Users table, we felt that the existing variables such as age and gender was not very suitable to be used in clustering. The age range was not very wide and customers were predominantly female.
 +
 
 +
<br/>
 +
[[File:V_1_Clustering_Variables.png|400px|center]]
 +
<div align="center"> Figure 1. Clustering variables </div>
 +
<br/>
 +
 
 +
Hence, we decided to utilize the 3 variables from the RFM analysis and generate the columns shown in Figure 1 above for each customer. However, only customers with at least 1 booking were considered to ensure that the clusters formed were distinct enough. For booking_recency, it refers to the number of weeks since the customer’s last booking with respect to 31 December 2016. In terms of the monetary variable, the total monetary value of all the customer’s bookings is used instead as average monetary value is not very indicative of how much the customer had spent in the last 2 years.
 +
 
 +
Next, before commencing K-Means Clustering, it is important to understand the distribution of the 3 variables. A variable that is highly skewed will have to be normalized as K-Means Clustering is highly susceptible to such noise and outliers.
 +
 
 +
<br/>
 +
[[File:V_2_Distribution_of_clulstering_variables.png|500px|center]]
 +
<div align="center"> Figure 2. Distribution of clustering variables </div>
 +
<br/>
 +
 
 +
Based on Figure 2 above, booking_frequency and booking_monetary_total are highly right-skewed while booking_recency is still relatively normal. Hence, for 2 variables with a skewed distribution, a log<sub>10</sub> transformation was used in an attempt to balance the distributions and make them more normal.
 +
 
 +
<br/>
 +
[[File:V_3_Distribution_of_clustering_variables_after_log10_transformation.png|500px|center]]
 +
<div align="center"> Figure 3. Distribution of clustering variables after log<sub>10</sub> transformation </div>
 +
<br/>
 +
 
 +
After applying the log<sub>10</sub> transformation, the revised distributions are shown in Figure 3. Interestingly, the transformation only managed to successfully normalize the distribution of booking_monetary_total while the distribution of booking_frequency remained largely right-skewed. This was most probably due to the fact the monetary values were more continuous as compared to the frequency which seemed more discrete. Hence, we selected the initial values of booking_frequency and booking_recency as well as the transformed values of booking_monetary_total for the clustering.
 +
 
 
</font></div>
 
</font></div>
  

Revision as of 01:15, 21 April 2017

V Logo.png


HOME   ABOUT US   PROJECT OVERVIEW   PROJECT FINDINGS   PROJECT MANAGEMENT   DOCUMENTATION   ALL PROJECTS



FINAL PROGRESS

Update

Moving forward from the Interim, our team has continued our analysis. The following will show our progress and findings from the Cluster Analysis and Association Analysis that we carried out. For better understanding on the entire project, you may look at our Final Research Paper. Thank you!

Cluster Analysis

Clustering Variables

After getting a clearer picture of the dataset through Exploratory Data Analysis, the next analysis that we will focus on is Cluster Analysis. The purpose of this analysis is to identify and segmentize current customers into distinct clusters of varying characteristics. First and foremost, we had to identify potential clustering variables from the Users table. After analyzing Users table, we felt that the existing variables such as age and gender was not very suitable to be used in clustering. The age range was not very wide and customers were predominantly female.


V 1 Clustering Variables.png
Figure 1. Clustering variables


Hence, we decided to utilize the 3 variables from the RFM analysis and generate the columns shown in Figure 1 above for each customer. However, only customers with at least 1 booking were considered to ensure that the clusters formed were distinct enough. For booking_recency, it refers to the number of weeks since the customer’s last booking with respect to 31 December 2016. In terms of the monetary variable, the total monetary value of all the customer’s bookings is used instead as average monetary value is not very indicative of how much the customer had spent in the last 2 years.

Next, before commencing K-Means Clustering, it is important to understand the distribution of the 3 variables. A variable that is highly skewed will have to be normalized as K-Means Clustering is highly susceptible to such noise and outliers.


V 2 Distribution of clulstering variables.png
Figure 2. Distribution of clustering variables


Based on Figure 2 above, booking_frequency and booking_monetary_total are highly right-skewed while booking_recency is still relatively normal. Hence, for 2 variables with a skewed distribution, a log10 transformation was used in an attempt to balance the distributions and make them more normal.


V 3 Distribution of clustering variables after log10 transformation.png
Figure 3. Distribution of clustering variables after log10 transformation


After applying the log10 transformation, the revised distributions are shown in Figure 3. Interestingly, the transformation only managed to successfully normalize the distribution of booking_monetary_total while the distribution of booking_frequency remained largely right-skewed. This was most probably due to the fact the monetary values were more continuous as compared to the frequency which seemed more discrete. Hence, we selected the initial values of booking_frequency and booking_recency as well as the transformed values of booking_monetary_total for the clustering.