Difference between revisions of "REO Project Findings"

From Analytics Practicum
Jump to navigation Jump to search
Line 15: Line 15:
  
 
| style="padding:0.3em; font-family:Arimo; font-size:110%; border-bottom:2px solid #053B6B; border-top:2px solid #053B6B; background:#053B6B; text-align:center;" width="10%" |  
 
| style="padding:0.3em; font-family:Arimo; font-size:110%; border-bottom:2px solid #053B6B; border-top:2px solid #053B6B; background:#053B6B; text-align:center;" width="10%" |  
[[REO_Project Proposal|<font face ="Avenir" color="#FFFFFF"><strong>PROJECT PROPOSAL</strong></font>]]
+
[[REO_Project Proposal|<font face ="Lucida Grande" color="#FFFFFF"><strong>PROJECT PROPOSAL</strong></font>]]
 
| style="border-bottom:2px solid #053B6B; border-top:2px solid #053B6B; background:#053B6B;" width="1%" | &nbsp;
 
| style="border-bottom:2px solid #053B6B; border-top:2px solid #053B6B; background:#053B6B;" width="1%" | &nbsp;
  
Line 40: Line 40:
 
===For Exploratory Data Analysis===
 
===For Exploratory Data Analysis===
 
Users: <br>
 
Users: <br>
Created “Members_Days” variable by counting the number of days passed by 31 Dec 2017 since the account was created. With the new variable, the team can analyse whether being a member for a longer time has impact on other factors.
+
* Created “Members_Days” variable by counting the number of days passed by 31 Dec 2017 since the account was created. With the new variable, the team can analyse whether being a member for a longer time has impact on other factors.  
Removal of users who joined after 31st December 2017 as they do not have at least 2 quarters of data available for analysis.
+
* Removal of users who joined after 31st December 2017 as they do not have at least 2 quarters of data available for analysis.  
Created Status variable into a binary variable with 1 representing an “active” user. Users who were previously labelled as “blacklisted” in the dataset are removed from analysis due to their restricted activity level.
+
* Created Status variable into a binary variable with 1 representing an “active” user. Users who were previously labelled as “blacklisted” in the dataset are removed from analysis due to their restricted activity. <br>
  
 
Subscription: <br>
 
Subscription: <br>
Created “Paid” variable as a binary variable with 1 representing paid and 0 representing free. Previously, these statuses were represented with strings.
+
* Created “Paid” variable as a binary variable with 1 representing paid and 0 representing free. Previously, these statuses were represented with strings.
  
 
Sessions: <br>
 
Sessions: <br>
Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
+
* Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
Created “Session_Weekdays” and “Session_Weekends” columns based on the value from “DaysOfWeek”.
+
* Created “Session_Weekdays” and “Session_Weekends” columns based on the value from “DaysOfWeek”.
Created “Hours” to convert the time into their respective hours.
+
* Created “Hours” to convert the time into their respective hours.
Created “WMAE” by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
+
* Created “WMAE” by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
Created columns such as “SE_E1”, “SE_E2”, “SE_E3”, “SE_E4”, “SE_D1”, “SE_D2”, “SE_D3” and “SE_D4” to indicate the frequencies of sessions taking place under the specific timeslot.
+
* Created columns such as “SE_E1”, “SE_E2”, “SE_E3”, “SE_E4”, “SE_D1”, “SE_D2”, “SE_D3” and “SE_D4” to indicate the frequencies of sessions taking place under the specific timeslot.
Created the columns “SE_DO” and “SE_EO” to indicate the total number of sessions taking place during weekdays and weekends.  
+
* Created the columns “SE_DO” and “SE_EO” to indicate the total number of sessions taking place during weekdays and weekends.  
Created “SE_TOTAL” to tabulate the sum of user sessions. The purpose of these transformation is to allow for an easier exploration on whether when the users utilise the portal has an impact on lead generation.
+
* Created “SE_TOTAL” to tabulate the sum of user sessions. The purpose of these transformation is to allow for an easier exploration on whether when the users utilise the portal has an impact on lead generation.
  
 
Listing:<br>
 
Listing:<br>
Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
+
* Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
Created “Listing_Weekdays” and “Listing_Weekends” columns based on the value from “DaysOfWeek”.
+
* Created “Listing_Weekdays” and “Listing_Weekends” columns based on the value from “DaysOfWeek”.
Created Hours to convert the time into their respective hours.
+
* Created Hours to convert the time into their respective hours.
Created WMAE by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
+
* Created WMAE by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
Created columns such as “LI_E1”, “LI_E2”, “LI_E3”, “LI_E4”, “LI_D1”, “LI_D2”, ”LI_D3” and “LI_D4” to indicate the frequencies of user sessions taking place during each respective time slot.
+
* Created columns such as “LI_E1”, “LI_E2”, “LI_E3”, “LI_E4”, “LI_D1”, “LI_D2”, ”LI_D3” and “LI_D4” to indicate the frequencies of user sessions taking place during each respective time slot.
Created the columns “LI_DO” and “LI_EO” to indicate the total number of sessions taking place during weekdays and weekends.  
+
* Created the columns “LI_DO” and “LI_EO” to indicate the total number of sessions taking place during weekdays and weekends.  
Created “LI_TOTAL” to tabulate the total sum of listings.
+
* Created “LI_TOTAL” to tabulate the total sum of listings.
Created two binary variables - Sync and Organic to replace the original variable represented with strings.
+
* Created two binary variables - Sync and Organic to replace the original variable represented with strings.
  
 
Enquiries and Cobroke: <br>
 
Enquiries and Cobroke: <br>
Enquiries and Cobroke are treated as the dependant variables.
+
* Enquiries and Cobroke are treated as the dependant variables.
These two variables were broken down in a similar manner to observe for any insights on the usage behaviours of home buyers.
+
* These two variables were broken down in a similar manner to observe for any insights on the usage behaviours of home buyers.
These variables were binned to represent the days they were created with “DaysofWeek”. They are then sorted into the respective bins which reflect if the day is a weekday or weekend. The bins are “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO”.
+
* These variables were binned to represent the days they were created with “DaysofWeek”. They are then sorted into the respective bins which reflect if the day is a weekday or weekend. The bins are “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO”.
Hours were binned into 6 hours slot – beginning from 0000 hrs. They are represented with the numbers 1 (0000hrs to 0600hrs), 2 (0600hrs to 1200hrs), 3 (1200hrs to 1800hrs) and 4 (1800hrs to 0000hrs). Created columns such as “EN_E1”, “EN_E2”, “EN_E3”, “EN_E4”, “EN_D1”, “EN_D2”, “EN_D3” and “EN_D4” to indicate the frequencies of user sessions taking place during each respective time slot. This treatment was not given to Cobroke because the team focuses on the leads acquire from primary home buyers.  
+
* Hours were binned into 6 hours slot – beginning from 0000 hrs. They are represented with the numbers 1 (0000hrs to 0600hrs), 2 (0600hrs to 1200hrs), 3 (1200hrs to 1800hrs) and 4 (1800hrs to 0000hrs). Created columns such as “EN_E1”, “EN_E2”, “EN_E3”, “EN_E4”, “EN_D1”, “EN_D2”, “EN_D3” and “EN_D4” to indicate the frequencies of user sessions taking place during each respective time slot. This treatment was not given to Cobroke because the team focuses on the leads acquire from primary home buyers.  
Created the columns “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO” to indicate the total number of enquiries and cobroke request received during weekdays and weekends.  
+
* Created the columns “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO” to indicate the total number of enquiries and cobroke request received during weekdays and weekends.  
Created “EN_TOTAL” and “CO_TOTAL” by summing “DO” and “EO” variables to reflect the sum of enquiries and cobroke requests each user receives.
+
* Created “EN_TOTAL” and “CO_TOTAL” by summing “DO” and “EO” variables to reflect the sum of enquiries and cobroke requests each user receives.
  
 
===For Construction of Dashboard===
 
===For Construction of Dashboard===
Line 80: Line 80:
  
 
The features included on the draft version of the dashboard are:
 
The features included on the draft version of the dashboard are:
Trend analysis: to observe the movement of trends and identify periods with spikes/dips in activity.
+
* Trend analysis: to observe the movement of trends and identify periods with spikes/dips in activity.
Patterns identification: to observe for any user pattern based on hours/days.
+
* Patterns identification: to observe for any user pattern based on hours/days.
User analysis: to understand the makeup of users at a quick glance with metrics such as membership length and proportion of paid/free users.
+
* User analysis: to understand the makeup of users at a quick glance with metrics such as membership length and proportion of paid/free users.
  
The team expects to make amendments to the available features in future iterations.
+
The team expects to make amendments to the available features in future iterations. <br>
 +
<br>
  
 
<u>Steps and Rationale</u> <br>
 
<u>Steps and Rationale</u> <br>
Line 94: Line 95:
 
The team first decided to merge the data in terms of date and hours to observe the trends of the activities conducted by users:
 
The team first decided to merge the data in terms of date and hours to observe the trends of the activities conducted by users:
  
Sessions, Listings, Enquiries, Cobroke files: <br>
+
Sessions, Listings, Enquiries, Cobroke: <br>
Created a “Date” variable to only capture the date portion from the “created_date”’s format of <dd:mm:yyyy hh:mm:ss>.  
+
* Created a “Date” variable to only capture the date portion from the “created_date”’s format of <dd:mm:yyyy hh:mm:ss>.  
Each data files are tabulated with columns “Date”, “Hours”, “# Sum of variables”.
+
* Each data files are tabulated with columns “Date”, “Hours”, “# Sum of variables”.
Starting with the data table tabulated from the Sessions file, files are being merged by matching columns “Data” and “Hours”.  
+
* Starting with the data table tabulated from the Sessions file, files are being merged by matching columns “Data” and “Hours”.  
For missing values found in some of the rows were all recode as value 0. As there were no activity, the user log did not contain any value, thus, it may appear to be missing values.  
+
* For missing values found in some of the rows were all recode as value 0. As there were no activity, the user log did not contain any value, thus, it may appear to be missing values.  
The final merged file [dashboard1.0] contains columns “Date”, “Hours”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.  
+
* The final merged file [dashboard1.0] contains columns “Date”, “Hours”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.  
  
 
The above data file does not capture any users’ characteristics such as their subscription plan or their membership length with REO due to the inability to upload the full dataset on Google Data Studio. Therefore, the team explored alternative ways to represent all the required data in a single sheet. The sheet captures the activity level of each user on separate dates. At a single glance, one can tell when a user is active and the total activity count for that specific date.  
 
The above data file does not capture any users’ characteristics such as their subscription plan or their membership length with REO due to the inability to upload the full dataset on Google Data Studio. Therefore, the team explored alternative ways to represent all the required data in a single sheet. The sheet captures the activity level of each user on separate dates. At a single glance, one can tell when a user is active and the total activity count for that specific date.  
  
 
Users: <br>
 
Users: <br>
Created “Mem_Bin” variable which bin the “Members_Days” by years (<1 year, <2 years, <3 years, >3 years), to study the difference in the activity level by users of different membership length.
+
* Created “Mem_Bin” variable which bin the “Members_Days” by years (<1 year, <2 years, <3 years, >3 years), to study the difference in the activity level by users of different membership length.
  
 
Sessions, Listings, Enquiries, Cobroke: <br>
 
Sessions, Listings, Enquiries, Cobroke: <br>
Each data files are tabulated with columns “user_id”, “Date”, “#Sum of variable”  
+
* Each data files are tabulated with columns “user_id”, “Date”, “#Sum of variable”  
For Users, Subscription, and Sessions files, they were merged if the “user_id” are present on all 3 sheets. Users lacking data in these areas were eliminated for further analysis as the team is not able to obtain a complete view of such users. For example, a user without the subscription variable cannot be included under paid nor free users.
+
* For Users, Subscription, and Sessions files, they were merged if the “user_id” are present on all 3 sheets. Users lacking data in these areas were eliminated for further analysis as the team is not able to obtain a complete view of such users. For example, a user without the subscription variable cannot be included under paid nor free users.
For missing values found in activities rows except sessions were all recode as value 0, as the missing values represented no activity being made by the particular user at that date.   
+
* For missing values found in activities rows except sessions were all recode as value 0, as the missing values represented no activity being made by the particular user at that date.   
  
 
The final merged file [dashboard5.0] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Date”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.  
 
The final merged file [dashboard5.0] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Date”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.  
Line 120: Line 121:
  
 
==Data Description==
 
==Data Description==
 +
===Correlation across all variables===
 +
[[File:corelation.png]]
 +
Some of the variables are highly correlated with each other, notably the total number of listings and total number of sessions as well as the number of organic listings and total number of listings. The former suggests that the more frequent a user logs onto the portal, the more posting the user posts. The latter relationship suggests that most of the listings posted are organic compared to synced.<br>
 +
 +
===Paid vs Free users===
 +
[[File:session.png]]
 +
The distribution of the total number of sessions as seen is positively skewed so the measures for the distribution regarding central tendency and dispersion are not very accurate. This is replicated across all variables created. As such, the team explores the option of splitting the users into paid and free. Based on our sponsor meeting and secondary research, we have also gathered that free users have restrictions regarding the number of listings they can post which could influence the number of sessions they logged onto the portal and number of enquiries they receive as well. There are 7536 free users and 5331 paid users in the dataset. <br>
 +
 +
====Number of Days since Registration====
 +
[[File:paidfree.png]]
 +
The average number of days since registration for paid users is 823.2 which is around 2 years 3 months. It is also interesting to note that there is a spike of paid users (1110 users) joining between 975 days and 999 days. This could be matched with the period where there is publicity for REO due to the circulation of articles regarding REO receiving funding from Facebook co-founder, Sequoia Capital. Being associated with a well-known organisation could help them to gain users. On the other hand, the average number of days joined for free users is 812 which is around the same as paid users. The pattern seems similar as the paid users where the spike takes place between 975 days and 999 days.<br>
 +
 +
====Key Activity Indicators====
 +
[[File:keyactivity.png]]
 +
Looking at variables representing activity level of an user (i.e. number of cobrokes and enquiries received, the average number of listings posted as well as the average number of sessions posted), we can see the activity level of a paid user is higher than a free user across all variables. <br>
 +
 +
 +
====Average Activity across Weekday and Weekend====
 +
[[File:wdwe.png]]
 +
To compare the number of sessions on weekdays and weekends, we have applied different weights to the total number of session (0.5 for weekends and 0.2 for weekdays) because of the number of days in the week. There is higher usage on weekends compared to weekdays as seen from the graphic above for both listings and number of sessions logged. This shows that there could be a segment of paid users that are active only on weekends.<br>
 +
 +
 +
====Average Activity across Timeslots====
 +
[[File:Avgact.png]]
 +
The team has previously established that the activity levels for paid and free users are different. As such, when we investigate the number of listings posted and sessions logged in 4 different timeslots, as seen below.
 +
[[File:timeslot.png]]
 +
As the nature of a property agent is flexible, the team hypothesised that the users would display an online pattern of appearing in the afternoon. This is because they are not restrained to a nine-to-six working hours. However, based on the graph above, we can conclude that the activity levels across timeslots 2,3 and 4 are rather similar. <br>
 +
 +
 +
====Type of Listings====
 +
[[File:listing.png]]
 +
As mentioned before, synced listings refer to listings that are imported to REO while organic listings refer to listings that are created on REO. Despite posting similar number of synced listings as free users, paid users post significantly more organic listings than the free users. This is beneficial for REO since it shows that paid users priortise using REO compared to other platforms. This is especially relevant in today’s age where users can be subscribing to multiple platforms to post their listings. <br>
 +
 +
 +
===Cluster Analysis===
 +
Having explored the differences between paid and free users, the team decides to perform a cluster analysis via an adapted approach suggested by Punj and Stewart (1983). The approach is:
 +
1. K-nearest neighbour algorithm to detect outliers
 +
2. Calculating Cubic Clustering Criterion (CCC) to decide on number of clusters
 +
3. K-means to obtain final cluster solution
 +
The team hopes to identify segments within the paid or free users such that REO can formulate strategies for each segment to effectively engage them.
 +
The team decides to exclude entries that have zero enquiries and cobroke requests as enquiries and cobroke requests are the tangible outputs from the platform. Thus, these users have not received any outputs from the platform for the last 6 months which may not be relevant for our cluster analysis.
 +
In addition, the technique of k-means clustering only include entries that are fully filled. As such, the team needs to perform missing value imputation by replacing missing data with zero for the analysis to run.
 +
 +
====Paid Users====
 +
<u>Decision on Clustering Variables</u>
 +
The team has chosen to use key activity indicators from the previous analysis which include the number of sessions, number of listings posted, number of enquiries and number of Cobroke requests received. The number of sessions and listings are then further broken down into weekday and weekend as well as each individual timeslot.
 +
[[File:clustering.png]]
 +
 +
<u>Examining Clustering Variables</u>
 +
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.
 +
[[File:clustering 1.png]]
 +
Through identifying extreme outliers that are significantly different from the other points, the team has excluded rows 1328, 3229, 4727 and 6442.
 +
 +
 +
<u>Transforming the Variables</u>
 +
Given that the variables are positively skewed, the team has decided to transform the variable through applying cube root. Logarithmic transformation is not used because there are up to 50% of zero entry in “ORGANIC” and “Cobroke_total” variables. However, we noticed that both variables mentioned previously are still slightly skewed because of the number of zeroes.
 +
[[File:clubroot.png]]
 +
 +
<u>K-Means Clustering</u>
 +
The cubic clustering criterion (CCC) is used to estimate the number of clusters using Ward’s minimum within-cluster variance method.
 +
[[File:ccc.png]]
 +
Based on c5 clusters are chosen
 +
 +
 +
<u>Profiling of Clusters</u>
 +
[[File:profile.png]]
 +
The z-score ranking method is a way to profile the clusters. Z-score is calculated to understand how many standard deviation the cluster’s variable mean is away from the variable mean. A positive number refers to being above the variable mean while a negative number refers to being below the variable mean.
 +
[[File:paidzscore.png]]
 +
[[File:paidzscore.png]]
 +
[[File:paidcluster.png]]
 +
 +
====Free Users====
 +
After filtering out, there are 5583 users who use the platform for free but receive at least one enquiry or cobroke request.
 +
 +
<u>Decision on Clustering Variables</u>
 +
[[File:clustering.png]]
 +
 +
<u>Examining Clustering Variables</u>
 +
The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.
 +
 +
The team decides to exclude row 10583 because it is an extreme outlier.
 +
 +
<u>Transforming the Variables</u>
 +
Similarly, we performed a cube-root transformation for all clustering variables.
 +
 +
<u>Screening Outliers </u>
 +
K-nearest neighbours algorithm is a form of hierarchical clustering to identify any outliers. The team decides to exclude 11 rows with the distance to 1 closest above 4.0.
 +
 +
<u>K-Means Clustering </u>
 +
 +
Based on the maximum point for cubic clustering criterion, the optimal number of clusters is 7.
 +
 +
 +
<u>Profiling of Clusters </u>
 +
 +
Similarly, we calculated the z-score for individual variable for each cluster.
 +
 +
 +
==Managerial Recommendation==
 +
1. REO should market the platform better to home seekers
 +
*Variables representing agents’ activity level 0.0 Pearson correlation to enquiries and cobroke. Sync listings and sessions have the highest score of 0.33 and 0.29 respectively. This suggests that agents’ activity level can possibly be independent from enquiries received. Therefore, if REO wants to improve the amount of leads for their subscribed users, they should consider increasing total visitors count onto their platform.
 +
 +
2. REO should investigate the profile of certain users
 +
*The exploratory clustering analysis has indicated that a group of free users are heavily utilising the platform – as indicated by the high z-score of 7.79 in organic listings and 3.88 in sessions. REO should investigate these accounts to uncover the reason behind their ability to bypass the limit of having only a total of 5 posts.
 +
 +
3. REO should encourage agents to upload new listings as early as possible
 +
*Based on the trendline of enquires, home seekers are actively sending out enquiries between late morning till late evening, with the highest usage at 9pm. By uploading listings early and checking for enquiries at night, agents would be able to ensure that their listings have greater potential to be read while reducing their response time to home seekers.
 +
 +
4. REO should study the listings of new users
 +
*By studying the dashboard, users who are on the platform for less than a year are receiving an average of 5.5 enquiries per user. This number of enquiries is impressive when expressed to the proportion of listings posted. In short, it can be said that the listings of new users are more effective at lead generation as compared to those from the more experienced users. Therefore, REO should try to understand the underlying cause and if possible, promote the findings as best practises to the other agents.<br>
 +
 +
 +
==Roadmap for the future==
 +
During the construction of the dashboard, the team observed several issues. For example:
  
[[File:metadata.png
+
# There can potentially be tradeoffs between scalability and accuracy of the dashboard. The team is aiming to produce a dashboard with the ease of updating being an important feature. However, misleading figures might be produced due to the presence of outliers. One observable example would be the free users’ average daily listings creation count of 10. This is illogical as an unpaid user can only maintain a maximum of 5 listings in total. To allow for the production of more reliable visualisation, additional steps such as removal of outliers are inevitable. Therefore, the team will discuss with the project sponsor if there is a required or acceptable degree of trade-offs.
 +
# Unexplained spikes affecting both free and paid users’ sessions. On a few occasions, the total user sessions increased sharply before returning to its’ original level. One instance recorded an increase of over 400% in a single day, from 10,000 sessions to 50,000 sessions. The team decides to clarify with the project sponsor as the underlying cause for this phenomenon is unclear.
 +
# The amount of leads that agents received are largely disproportionate to the total amount of 12.9 thousand users. At a glance, the visualisation indicated that during the busiest hour, only 82.5 enquiries were generated in a single hour. This visualisation might frame REO as an ineffective platform. The team decided to clarify with the client if the amount of leads generated is satisfactory for the agents. Additionally, the team will explore the possibility of benchmarking the amount of leads to the total number of residential real estate transaction in order to better understand the proportion of REO’s slice of pie.

Revision as of 03:34, 26 February 2018


Back to ANLY482 AY2017-18 Home Page

HOME

 

ABOUT US

 

PROJECT PROPOSAL

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 


Data Preparation

As the data were received in separated sheets, it is necessary to perform some transformation before they are ready for deeper analysis. The team only has either aggregated counts such as total number of enquiries per quarter for each user or system logs. The team are not able to analyse these data as the total counts have do not have sufficient meaning and the system logs are too messy.

Therefore, the objective of the data preparation is to convert the system logs into an intermediary stage where it analysis of value can be done easily while retaining sufficient meaning.

For Exploratory Data Analysis

Users:

  • Created “Members_Days” variable by counting the number of days passed by 31 Dec 2017 since the account was created. With the new variable, the team can analyse whether being a member for a longer time has impact on other factors.
  • Removal of users who joined after 31st December 2017 as they do not have at least 2 quarters of data available for analysis.
  • Created Status variable into a binary variable with 1 representing an “active” user. Users who were previously labelled as “blacklisted” in the dataset are removed from analysis due to their restricted activity.

Subscription:

  • Created “Paid” variable as a binary variable with 1 representing paid and 0 representing free. Previously, these statuses were represented with strings.

Sessions:

  • Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
  • Created “Session_Weekdays” and “Session_Weekends” columns based on the value from “DaysOfWeek”.
  • Created “Hours” to convert the time into their respective hours.
  • Created “WMAE” by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
  • Created columns such as “SE_E1”, “SE_E2”, “SE_E3”, “SE_E4”, “SE_D1”, “SE_D2”, “SE_D3” and “SE_D4” to indicate the frequencies of sessions taking place under the specific timeslot.
  • Created the columns “SE_DO” and “SE_EO” to indicate the total number of sessions taking place during weekdays and weekends.
  • Created “SE_TOTAL” to tabulate the sum of user sessions. The purpose of these transformation is to allow for an easier exploration on whether when the users utilise the portal has an impact on lead generation.

Listing:

  • Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
  • Created “Listing_Weekdays” and “Listing_Weekends” columns based on the value from “DaysOfWeek”.
  • Created Hours to convert the time into their respective hours.
  • Created WMAE by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
  • Created columns such as “LI_E1”, “LI_E2”, “LI_E3”, “LI_E4”, “LI_D1”, “LI_D2”, ”LI_D3” and “LI_D4” to indicate the frequencies of user sessions taking place during each respective time slot.
  • Created the columns “LI_DO” and “LI_EO” to indicate the total number of sessions taking place during weekdays and weekends.
  • Created “LI_TOTAL” to tabulate the total sum of listings.
  • Created two binary variables - Sync and Organic to replace the original variable represented with strings.

Enquiries and Cobroke:

  • Enquiries and Cobroke are treated as the dependant variables.
  • These two variables were broken down in a similar manner to observe for any insights on the usage behaviours of home buyers.
  • These variables were binned to represent the days they were created with “DaysofWeek”. They are then sorted into the respective bins which reflect if the day is a weekday or weekend. The bins are “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO”.
  • Hours were binned into 6 hours slot – beginning from 0000 hrs. They are represented with the numbers 1 (0000hrs to 0600hrs), 2 (0600hrs to 1200hrs), 3 (1200hrs to 1800hrs) and 4 (1800hrs to 0000hrs). Created columns such as “EN_E1”, “EN_E2”, “EN_E3”, “EN_E4”, “EN_D1”, “EN_D2”, “EN_D3” and “EN_D4” to indicate the frequencies of user sessions taking place during each respective time slot. This treatment was not given to Cobroke because the team focuses on the leads acquire from primary home buyers.
  • Created the columns “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO” to indicate the total number of enquiries and cobroke request received during weekdays and weekends.
  • Created “EN_TOTAL” and “CO_TOTAL” by summing “DO” and “EO” variables to reflect the sum of enquiries and cobroke requests each user receives.

For Construction of Dashboard

As mentioned in the earlier section, the purpose of the dashboard is meant for internal communication within REO. Therefore, it is essential that key trends on agents’ behaviour can be observed at a glance.

For the purpose of co-creating with the project sponsor, the team had decided to prepare a draft for the interim presentation. This is to ensure that the team can have a firmer grasp of the sponsor’s requirement. Most importantly, this is to eliminate any possibility of disagreement during final delivery.

The features included on the draft version of the dashboard are:

  • Trend analysis: to observe the movement of trends and identify periods with spikes/dips in activity.
  • Patterns identification: to observe for any user pattern based on hours/days.
  • User analysis: to understand the makeup of users at a quick glance with metrics such as membership length and proportion of paid/free users.

The team expects to make amendments to the available features in future iterations.

Steps and Rationale

As the team was unfamiliar with Google Data Studio, they experienced many difficulties from working with the tool. They are the inability to upload certain data files with constraints such as format (requires UTF-8 encoded file) and size. Additionally, the team must relearn the features and functions of the software to achieve the desired visualisation that best represent the data. Furthermore, the process of validating the accuracy of the results aggregated by the software was a tedious job. Any discrepancy was an indication that the uploaded merged data file was not tabulated correctly. There were several iterations being made before the data was finally able to be used effectively by the software.

The team could not simply combine all the data together in its rawest form onto the Data Studio all together because the combined file was too large with more than 2 million rows. The team then decided to reduce the file size while ensuring that the uploaded data are sufficient to construct the required visualisations.

The team first decided to merge the data in terms of date and hours to observe the trends of the activities conducted by users:

Sessions, Listings, Enquiries, Cobroke:

  • Created a “Date” variable to only capture the date portion from the “created_date”’s format of <dd:mm:yyyy hh:mm:ss>.
  • Each data files are tabulated with columns “Date”, “Hours”, “# Sum of variables”.
  • Starting with the data table tabulated from the Sessions file, files are being merged by matching columns “Data” and “Hours”.
  • For missing values found in some of the rows were all recode as value 0. As there were no activity, the user log did not contain any value, thus, it may appear to be missing values.
  • The final merged file [dashboard1.0] contains columns “Date”, “Hours”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.

The above data file does not capture any users’ characteristics such as their subscription plan or their membership length with REO due to the inability to upload the full dataset on Google Data Studio. Therefore, the team explored alternative ways to represent all the required data in a single sheet. The sheet captures the activity level of each user on separate dates. At a single glance, one can tell when a user is active and the total activity count for that specific date.

Users:

  • Created “Mem_Bin” variable which bin the “Members_Days” by years (<1 year, <2 years, <3 years, >3 years), to study the difference in the activity level by users of different membership length.

Sessions, Listings, Enquiries, Cobroke:

  • Each data files are tabulated with columns “user_id”, “Date”, “#Sum of variable”
  • For Users, Subscription, and Sessions files, they were merged if the “user_id” are present on all 3 sheets. Users lacking data in these areas were eliminated for further analysis as the team is not able to obtain a complete view of such users. For example, a user without the subscription variable cannot be included under paid nor free users.
  • For missing values found in activities rows except sessions were all recode as value 0, as the missing values represented no activity being made by the particular user at that date.

The final merged file [dashboard5.0] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Date”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.

While the data file [dashboard 5.0] were accurately merged and represented on Google Data Studio, the aggregated average score of each activity for different profile of the users are inaccurately represented. For example, having multiple and uneven amount of entries belonging to the same user had caused the dashboard to wrongly compute statistics such as proportion of free to paid users. This was due to the nature of the data which has “users_id” duplicated for each “Date” where activity happened. Therefore, a tabulated file was made to remove the “Date” information and collate the activity level by each user over the same period.

The new tabulated file [dashboard5.2] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.

With this new additional data set, average scores of each activity per user can be tabulated correctly. Furthermore, average scores of each activity per user can be broken down to observe the difference by different subscription plans and the length of membership with REO.

Data Description

Correlation across all variables

File:Corelation.png Some of the variables are highly correlated with each other, notably the total number of listings and total number of sessions as well as the number of organic listings and total number of listings. The former suggests that the more frequent a user logs onto the portal, the more posting the user posts. The latter relationship suggests that most of the listings posted are organic compared to synced.

File:Session.png The distribution of the total number of sessions as seen is positively skewed so the measures for the distribution regarding central tendency and dispersion are not very accurate. This is replicated across all variables created. As such, the team explores the option of splitting the users into paid and free. Based on our sponsor meeting and secondary research, we have also gathered that free users have restrictions regarding the number of listings they can post which could influence the number of sessions they logged onto the portal and number of enquiries they receive as well. There are 7536 free users and 5331 paid users in the dataset.

Number of Days since Registration

File:Paidfree.png The average number of days since registration for paid users is 823.2 which is around 2 years 3 months. It is also interesting to note that there is a spike of paid users (1110 users) joining between 975 days and 999 days. This could be matched with the period where there is publicity for REO due to the circulation of articles regarding REO receiving funding from Facebook co-founder, Sequoia Capital. Being associated with a well-known organisation could help them to gain users. On the other hand, the average number of days joined for free users is 812 which is around the same as paid users. The pattern seems similar as the paid users where the spike takes place between 975 days and 999 days.

Key Activity Indicators

File:Keyactivity.png Looking at variables representing activity level of an user (i.e. number of cobrokes and enquiries received, the average number of listings posted as well as the average number of sessions posted), we can see the activity level of a paid user is higher than a free user across all variables.

Average Activity across Weekday and Weekend

File:Wdwe.png To compare the number of sessions on weekdays and weekends, we have applied different weights to the total number of session (0.5 for weekends and 0.2 for weekdays) because of the number of days in the week. There is higher usage on weekends compared to weekdays as seen from the graphic above for both listings and number of sessions logged. This shows that there could be a segment of paid users that are active only on weekends.


Average Activity across Timeslots

File:Avgact.png The team has previously established that the activity levels for paid and free users are different. As such, when we investigate the number of listings posted and sessions logged in 4 different timeslots, as seen below. File:Timeslot.png As the nature of a property agent is flexible, the team hypothesised that the users would display an online pattern of appearing in the afternoon. This is because they are not restrained to a nine-to-six working hours. However, based on the graph above, we can conclude that the activity levels across timeslots 2,3 and 4 are rather similar.

Type of Listings

File:Listing.png As mentioned before, synced listings refer to listings that are imported to REO while organic listings refer to listings that are created on REO. Despite posting similar number of synced listings as free users, paid users post significantly more organic listings than the free users. This is beneficial for REO since it shows that paid users priortise using REO compared to other platforms. This is especially relevant in today’s age where users can be subscribing to multiple platforms to post their listings.

Cluster Analysis

Having explored the differences between paid and free users, the team decides to perform a cluster analysis via an adapted approach suggested by Punj and Stewart (1983). The approach is: 1. K-nearest neighbour algorithm to detect outliers 2. Calculating Cubic Clustering Criterion (CCC) to decide on number of clusters 3. K-means to obtain final cluster solution The team hopes to identify segments within the paid or free users such that REO can formulate strategies for each segment to effectively engage them. The team decides to exclude entries that have zero enquiries and cobroke requests as enquiries and cobroke requests are the tangible outputs from the platform. Thus, these users have not received any outputs from the platform for the last 6 months which may not be relevant for our cluster analysis. In addition, the technique of k-means clustering only include entries that are fully filled. As such, the team needs to perform missing value imputation by replacing missing data with zero for the analysis to run.

Decision on Clustering Variables The team has chosen to use key activity indicators from the previous analysis which include the number of sessions, number of listings posted, number of enquiries and number of Cobroke requests received. The number of sessions and listings are then further broken down into weekday and weekend as well as each individual timeslot. File:Clustering.png

Examining Clustering Variables The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness. File:Clustering 1.png Through identifying extreme outliers that are significantly different from the other points, the team has excluded rows 1328, 3229, 4727 and 6442.

Transforming the Variables Given that the variables are positively skewed, the team has decided to transform the variable through applying cube root. Logarithmic transformation is not used because there are up to 50% of zero entry in “ORGANIC” and “Cobroke_total” variables. However, we noticed that both variables mentioned previously are still slightly skewed because of the number of zeroes. File:Clubroot.png

K-Means Clustering The cubic clustering criterion (CCC) is used to estimate the number of clusters using Ward’s minimum within-cluster variance method. File:Ccc.png Based on c5 clusters are chosen

Profiling of Clusters File:Profile.png The z-score ranking method is a way to profile the clusters. Z-score is calculated to understand how many standard deviation the cluster’s variable mean is away from the variable mean. A positive number refers to being above the variable mean while a negative number refers to being below the variable mean. File:Paidzscore.png File:Paidzscore.png File:Paidcluster.png

Free Users

After filtering out, there are 5583 users who use the platform for free but receive at least one enquiry or cobroke request.

Decision on Clustering Variables File:Clustering.pngExamining Clustering Variables The team has conducted a descriptive analysis for all clustering variables to identify any outliers and check for skewness.

The team decides to exclude row 10583 because it is an extreme outlier.

Transforming the Variables Similarly, we performed a cube-root transformation for all clustering variables.

Screening Outliers K-nearest neighbours algorithm is a form of hierarchical clustering to identify any outliers. The team decides to exclude 11 rows with the distance to 1 closest above 4.0.

K-Means Clustering

Based on the maximum point for cubic clustering criterion, the optimal number of clusters is 7.


Profiling of Clusters

Similarly, we calculated the z-score for individual variable for each cluster.


Managerial Recommendation

1. REO should market the platform better to home seekers

  • Variables representing agents’ activity level 0.0 Pearson correlation to enquiries and cobroke. Sync listings and sessions have the highest score of 0.33 and 0.29 respectively. This suggests that agents’ activity level can possibly be independent from enquiries received. Therefore, if REO wants to improve the amount of leads for their subscribed users, they should consider increasing total visitors count onto their platform.

2. REO should investigate the profile of certain users

  • The exploratory clustering analysis has indicated that a group of free users are heavily utilising the platform – as indicated by the high z-score of 7.79 in organic listings and 3.88 in sessions. REO should investigate these accounts to uncover the reason behind their ability to bypass the limit of having only a total of 5 posts.

3. REO should encourage agents to upload new listings as early as possible

  • Based on the trendline of enquires, home seekers are actively sending out enquiries between late morning till late evening, with the highest usage at 9pm. By uploading listings early and checking for enquiries at night, agents would be able to ensure that their listings have greater potential to be read while reducing their response time to home seekers.

4. REO should study the listings of new users

  • By studying the dashboard, users who are on the platform for less than a year are receiving an average of 5.5 enquiries per user. This number of enquiries is impressive when expressed to the proportion of listings posted. In short, it can be said that the listings of new users are more effective at lead generation as compared to those from the more experienced users. Therefore, REO should try to understand the underlying cause and if possible, promote the findings as best practises to the other agents.

  �

Roadmap for the future

During the construction of the dashboard, the team observed several issues. For example:

  1. There can potentially be tradeoffs between scalability and accuracy of the dashboard. The team is aiming to produce a dashboard with the ease of updating being an important feature. However, misleading figures might be produced due to the presence of outliers. One observable example would be the free users’ average daily listings creation count of 10. This is illogical as an unpaid user can only maintain a maximum of 5 listings in total. To allow for the production of more reliable visualisation, additional steps such as removal of outliers are inevitable. Therefore, the team will discuss with the project sponsor if there is a required or acceptable degree of trade-offs.
  2. Unexplained spikes affecting both free and paid users’ sessions. On a few occasions, the total user sessions increased sharply before returning to its’ original level. One instance recorded an increase of over 400% in a single day, from 10,000 sessions to 50,000 sessions. The team decides to clarify with the project sponsor as the underlying cause for this phenomenon is unclear.
  3. The amount of leads that agents received are largely disproportionate to the total amount of 12.9 thousand users. At a glance, the visualisation indicated that during the busiest hour, only 82.5 enquiries were generated in a single hour. This visualisation might frame REO as an ineffective platform. The team decided to clarify with the client if the amount of leads generated is satisfactory for the agents. Additionally, the team will explore the possibility of benchmarking the amount of leads to the total number of residential real estate transaction in order to better understand the proportion of REO’s slice of pie.