Difference between revisions of "REO Project Findings"

Latest revision as of 15:17, 16 April 2018

As the data were received in separated sheets, it is necessary to perform some transformation before they are ready for deeper analysis. The team are not able to analyse these raw data files as the system logs are too specific.

Therefore, the objective of the data preparation is to convert the system logs into an intermediary stage where it analysis of value can be done easily while retaining sufficient meaning.

For Exploratory Data Analysis

For Data File Users,

Created “Members_Days” variable by counting the number of days passed by 31 Dec 2017 since the account was created. With the new variable, the team can analyse whether being a member for a longer time has impact on other factors.
Removal of users who joined after 31st December 2017 as they do not have at least 2 quarters of data available for analysis.
152 (0.66%) Users who were previously labelled as “blacklisted” in the dataset are removed from analysis due to their restricted activity level.

For Data File Sessions,

Tabulating the data by each “user_id”
Created “hours” column to convert the “created_date” into their respective hours (0 as 12am to 23 as 11pm).
Created “month” to convert the “created_date” into their respective month from July to December 2017.
Created “day_of_week” to convert the “created_date” into 1 to 7, representing Sunday to Saturday respectively.
Include “plan” column from Subscription file.

For Data File Listings,

Tabulated the data by each “user_id”
Created “hours” column to convert the “created_date” into their respective hours (0 as 12am to 23 as 11pm).
Created “month” to convert the “created_date” into their respective month from July to December 2017.
Created “day_of_week” to convert the “created_date” into 1 to 7, representing Sunday to Saturday respectively.
Tabulated the data by each “users_id” and “source_type” as columns
Include “plan” column from Subscription file.

For Data File Cobroke,

Tabulating the data by each “user_id”
Include “plan” column from Subscription file.
Created “month” to convert the “created_date” into their respective month from July to December 2017.

For Data File Enquiries,

Tabulated the data by each “user_id”
Tabulated the data by each “user_id” and “enquiries_type” as columns
Include “plan” column from Subscription file.
Created “month” to convert the “created_date” into their respective month from July to December 2017.

For Clustering Analysis

Step 1:
For “Users” file, date joined was converted to membership length with REO based on the number of days. While the other original files recorded their activity levels, the logs data were collapsed into individual count for each user as an aggregated number. For “Listing” and “Enquiries” files, an user can post both forms of listings as well as receive different kinds of enquiries, showing that they are not mutually exclusive unlike plan. As such, “Listing” and “Enquiries” files were separated into two different variables.

Step 2:

The files were merged into rows separated by a unique user_id. First, we have each user’s membership length and the status of their subscription plan. All the other activities count for each user were subsequently merged with a total of 20,762 users. The merged dataset contains these variables: sessions, organic listings, synced listings, cobroke, enquiries from agents and enquiries from consumers.

Step 3:

As missing values indicate a lack of activity recorded, it could be represented by “0”. Recording missing values as zero is crucial because cluster analysis would exclude rows with missing values.

Choice of Users The choice of users is not narrowed down to the type of plan they use because depending on their user behavior, they have potential to be retained if they are paid users or converted if they are free users. As the objective is to identify segments to improve engagement, we do not differentiate paid or free users in clustering analysis.

Variables Selection for Clustering Analysis

As clustering analysis is an unsupervised method, it is necessary to consider all initial variables and eliminate them as inputs with reasons. Multivariate analysis was conducted for all variables. We found that “Sessions” and “Organic listings” (r = 0.9065), “Enquiries_agent” and “Enquiries_consumer” (r = 0.7344) have a correlation coefficient of above 0.7. This meant that the two pairs of variables are strongly correlated. From a statistical perspective, to prevent overrepresentation in cluster analysis, one variable from each offending pair needs to be dropped.
The variable Organic was dropped from selections in favour for Sessions variable as the variable is even more skewed as indicated by the quantiles value from the figure below.

Enquiries from Agent were dropped in favour of Enquiries from Consumer as well. This decision was arrived at with the input from the sponsor, who indicated that there were a significant proportion of unwanted enquiries or “spam” from agents to agents. From a business perspective, this variable might be poorer at representing actual commercial interests as compared to enquiries from consumers.
Member_Days variable was not included for clustering analysis as the sponsor mentioned that accounts could be mass purchased, thus having the same members day but not exactly indicative of the number of days spent for each agent.
Therefore, the final four variables selected for cluster analysis are: Sessions, Synced, Cobroke and Enquiries from Consumers. These clustering variables are specific and observable. The descriptions are as follows:

Before the clustering analysis was conducted, observations with no sessions over the past six months were removed. These users are likely to have abandoned the usage of the portal and thus, meaningful observations are unlikely to surface from studying them through the current dataset. With their removal using the data filter, 7895 observations were excluded from the initial 20762. These observations will be grouped in Cluster Number 0 as abandoned users.

All continuous variables’ distributions were reviewed to check for spread of the data. As seen above, the variables are highly skewed. Therefore, transformation was to be conducted through the global Johnson transformation option in cluster analysis. As the range of values for each of the variables were reasonably close, there was no need for standardization.

After conducting the global transformation, the observations appear to follow more of a normal distribution curve. However, it is evident that the Global Johnson Transformation did not fully resolve the outlier issues.

@@ Line 38: / Line 38: @@
 | style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_EDA | <font color="#053B6B">Exploratory Data Analysis</font>]]
-| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_Cluster | <font color="#053B6B">Cluster Analysis</font>]]
+| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_Cluster | <font color="#053B6B">Clustering Analysis</font>]]
 |}
+As the data were received in separated sheets, it is necessary to perform some transformation before they are ready for deeper analysis. The team are not able to analyse these raw data files as the system logs are too specific.
-As the data were received in separated sheets, it is necessary to perform some transformation before they are ready for deeper analysis. The team only has either aggregated counts such as total number of enquiries per quarter for each user or system logs. The team are not able to analyse these data as the total counts have do not have sufficient meaning and the system logs are too messy.
+Therefore, the objective of the data preparation is to convert the system logs into an intermediary stage where it analysis of value can be done easily while retaining sufficient meaning.
-Therefore, the objective of the data preparation is to convert the system logs into an intermediary stage where it analysis of value can be done easily while retaining sufficient meaning.
 ==For Exploratory Data Analysis==
-Users: <br>
+For Data File <u>Users</u>,
-* Created “Members_Days” variable by counting the number of days passed by 31 Dec 2017 since the account was created. With the new variable, the team can analyse whether being a member for a longer time has impact on other factors.
+*	Created “Members_Days” variable by counting the number of days passed by 31 Dec 2017 since the account was created. With the new variable, the team can analyse whether being a member for a longer time has impact on other factors.
-* Removal of users who joined after 31st December 2017 as they do not have at least 2 quarters of data available for analysis.
+*	Removal of users who joined after 31st December 2017 as they do not have at least 2 quarters of data available for analysis.
-* Created Status variable into a binary variable with 1 representing an “active” user. Users who were previously labelled as “blacklisted” in the dataset are removed from analysis due to their restricted activity. <br>
+*	152 (0.66%) Users who were previously labelled as “blacklisted” in the dataset are removed from analysis due to their restricted activity level.
-Subscription: <br>
-* Created “Paid” variable as a binary variable with 1 representing paid and 0 representing free. Previously, these statuses were represented with strings.
-Sessions: <br>
-* Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
-* Created “Session_Weekdays” and “Session_Weekends” columns based on the value from “DaysOfWeek”.
-* Created “Hours” to convert the time into their respective hours.
-* Created “WMAE” by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
-* Created columns such as “SE_E1”, “SE_E2”, “SE_E3”, “SE_E4”, “SE_D1”, “SE_D2”, “SE_D3” and “SE_D4” to indicate the frequencies of sessions taking place under the specific timeslot.
-* Created the columns “SE_DO” and “SE_EO” to indicate the total number of sessions taking place during weekdays and weekends.
-* Created “SE_TOTAL” to tabulate the sum of user sessions. The purpose of these transformation is to allow for an easier exploration on whether when the users utilise the portal has an impact on lead generation.
-Listing:<br>
-* Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
-* Created “Listing_Weekdays” and “Listing_Weekends” columns based on the value from “DaysOfWeek”.
-* Created Hours to convert the time into their respective hours.
-* Created WMAE by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
-* Created columns such as “LI_E1”, “LI_E2”, “LI_E3”, “LI_E4”, “LI_D1”, “LI_D2”, ”LI_D3” and “LI_D4” to indicate the frequencies of user sessions taking place during each respective time slot.
-* Created the columns “LI_DO” and “LI_EO” to indicate the total number of sessions taking place during weekdays and weekends.
-* Created “LI_TOTAL” to tabulate the total sum of listings.
-* Created two binary variables - Sync and Organic to replace the original variable represented with strings.
-Enquiries and Cobroke: <br>
-* Enquiries and Cobroke are treated as the dependant variables.
-* These two variables were broken down in a similar manner to observe for any insights on the usage behaviours of home buyers.
-* These variables were binned to represent the days they were created with “DaysofWeek”. They are then sorted into the respective bins which reflect if the day is a weekday or weekend. The bins are “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO”.
-* Hours were binned into 6 hours slot – beginning from 0000 hrs. They are represented with the numbers 1 (0000hrs to 0600hrs), 2 (0600hrs to 1200hrs), 3 (1200hrs to 1800hrs) and 4 (1800hrs to 0000hrs). Created columns such as “EN_E1”, “EN_E2”, “EN_E3”, “EN_E4”, “EN_D1”, “EN_D2”, “EN_D3” and “EN_D4” to indicate the frequencies of user sessions taking place during each respective time slot. This treatment was not given to Cobroke because the team focuses on the leads acquire from primary home buyers.
-* Created the columns “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO” to indicate the total number of enquiries and cobroke request received during weekdays and weekends.
-* Created “EN_TOTAL” and “CO_TOTAL” by summing “DO” and “EO” variables to reflect the sum of enquiries and cobroke requests each user receives.
-Merged file [Final Dataset]:
-All the intermediary files were first tabulated with “User_id” as the first column before merging all the files together with the fields stated in the table below.
-[[File:REO_merge_file.png|500px]]
-* Created columns “SE_E_Average” and “SE_D_Average” to find the average session on  weekend and weekday respectively.
-* Created columns “SE_1_Average”, “SE_2_Average”, “SE_2_Average” , “SE_2_Average” to find the average session in each 6 hour slot.
-* Created columns “LI_E_Average” and “LI_D_Average” to find the average listing on weekend and weekday respectively.
-* Created columns “LI_1_Average”, “LI_2_Average”, “LI_2_Average” , “LI_2_Average” to find the average listings in each 6 hour slot.
-==For Construction of Dashboard==
-As mentioned in the earlier section, the purpose of the dashboard is meant for internal communication within REO. Therefore, it is essential that key trends on agents’ behaviour can be observed at a glance.
-For the purpose of co-creating with the project sponsor, the team had decided to prepare a draft for the interim presentation. This is to ensure that the team can have a firmer grasp of the sponsor’s requirement. Most importantly, this is to eliminate any possibility of disagreement during final delivery.
-The features included on the draft version of the dashboard are:
-* Trend analysis: to observe the movement of trends and identify periods with spikes/dips in activity.
-* Patterns identification: to observe for any user pattern based on hours/days.
-* User analysis: to understand the makeup of users at a quick glance with metrics such as membership length and proportion of paid/free users.
-The team expects to make amendments to the available features in future iterations. <br>
-<br>
-<u>Steps and Rationale</u> <br>
+<br>For Data File <u>Sessions</u>,
+*	Tabulating the data by each “user_id”
-As the team was unfamiliar with Google Data Studio, they experienced many difficulties from working with the tool. They are the inability to upload certain data files with constraints such as format (requires UTF-8 encoded file) and size. Additionally, the team must relearn the features and functions of the software to achieve the desired visualisation that best represent the data. Furthermore, the process of validating the accuracy of the results aggregated by the software was a tedious job. Any discrepancy was an indication that the uploaded merged data file was not tabulated correctly. There were several iterations being made before the data was finally able to be used effectively by the software.
+*	Created “hours” column to convert the “created_date” into their respective hours (0 as 12am to 23 as 11pm).
+*	Created “month” to convert the “created_date” into their respective month from July to December 2017.
+*	Created “day_of_week” to convert the “created_date” into 1 to 7, representing Sunday to Saturday respectively.
+*	Include “plan” column from Subscription file.
-The team could not simply combine all the data together in its rawest form onto the Data Studio all together because the combined file was too large with more than 2 million rows. The team then decided to reduce the file size while ensuring that the uploaded data are sufficient to construct the required visualisations.
+<br>For Data File <u>Listings</u>,
+*	Tabulated the data by each “user_id”
+*	Created “hours” column to convert the “created_date” into their respective hours (0 as 12am to 23 as 11pm).
+*	Created “month” to convert the “created_date” into their respective month from July to December 2017.
+*	Created “day_of_week” to convert the “created_date” into 1 to 7, representing Sunday to Saturday respectively.
+*	Tabulated the data by each “users_id” and “source_type” as columns
+*	Include “plan” column from Subscription file.
-The team first decided to merge the data in terms of date and hours to observe the trends of the activities conducted by users:
+<br>For Data File <u>Cobroke</u>,
+*	Tabulating the data by each “user_id”
+*	Include “plan” column from Subscription file.
+*	Created “month” to convert the “created_date” into their respective month from July to December 2017.
-Sessions, Listings, Enquiries, Cobroke: <br>
+<br>For Data File <u>Enquiries</u>,
-* Created a “Date” variable to only capture the date portion from the “created_date”’s format of <dd:mm:yyyy hh:mm:ss>.
+*	Tabulated the data by each “user_id”
-* Each data files are tabulated with columns “Date”, “Hours”, “# Sum of variables”.
+*	Tabulated the data by each “user_id” and “enquiries_type” as columns
-* Starting with the data table tabulated from the Sessions file, files are being merged by matching columns “Data” and “Hours”.
+*	Include “plan” column from Subscription file.
-* For missing values found in some of the rows were all recode as value 0. As there were no activity, the user log did not contain any value, thus, it may appear to be missing values.
+*	Created “month” to convert the “created_date” into their respective month from July to December 2017.
-* The final merged file [dashboard1.0] contains columns “Date”, “Hours”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.
-The above data file does not capture any users’ characteristics such as their subscription plan or their membership length with REO due to the inability to upload the full dataset on Google Data Studio. Therefore, the team explored alternative ways to represent all the required data in a single sheet. The sheet captures the activity level of each user on separate dates. At a single glance, one can tell when a user is active and the total activity count for that specific date.
-Users: <br>
+==For Clustering Analysis==
-* Created “Mem_Bin” variable which bin the “Members_Days” by years (<1 year, <2 years, <3 years, >3 years), to study the difference in the activity level by users of different membership length.
+Step 1: <br>For “Users” file, date joined was converted to membership length with REO based on the number of days. While the other original files recorded their activity levels, the logs data were collapsed into individual count for each user as an aggregated number. For “Listing” and “Enquiries” files, an user can post both forms of listings as well as receive different kinds of enquiries, showing that they are not mutually exclusive unlike plan. As such, “Listing” and “Enquiries” files were separated into two different variables.
-Sessions, Listings, Enquiries, Cobroke: <br>
+<br>Step 2: <br>
-* Each data files are tabulated with columns “user_id”, “Date”, “#Sum of variable”
+[[File:REO_fig2.png|300px]]<br>
-* For Users, Subscription, and Sessions files, they were merged if the “user_id” are present on all 3 sheets. Users lacking data in these areas were eliminated for further analysis as the team is not able to obtain a complete view of such users. For example, a user without the subscription variable cannot be included under paid nor free users.
+The files were merged into rows separated by a unique user_id. First, we have each user’s membership length and the status of their subscription plan. All the other activities count for each user were subsequently merged with a total of 20,762 users. The merged dataset contains these variables: sessions, organic listings, synced listings, cobroke, enquiries from agents and enquiries from consumers.
-* For missing values found in activities rows except sessions were all recode as value 0, as the missing values represented no activity being made by the particular user at that date.
+<br>[[File:REO_table1.png|400px]]
-The final merged file [dashboard5.0] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Date”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.
+<br>Step 3: <br>
+[[File:REO_table2.png|300px]]<br>
+As missing values indicate a lack of activity recorded, it could be represented by “0”. Recording missing values as zero is crucial because cluster analysis would exclude rows with missing values.
-While the data file [dashboard 5.0] were accurately merged and represented on Google Data Studio, the aggregated average score of each activity for different profile of the users are inaccurately represented. For example, having multiple and uneven amount of entries belonging to the same user had caused the dashboard to wrongly compute statistics such as proportion of free to paid users. This was due to the nature of the data which has “users_id” duplicated for each “Date” where activity happened. Therefore, a tabulated file was made to remove the “Date” information and collate the activity level by each user over the same period.
-The new tabulated file [dashboard5.2] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.
+<b>Choice of Users</b>
+The choice of users is not narrowed down to the type of plan they use because depending on their user behavior, they have potential to be retained if they are paid users or converted if they are free users. As the objective is to identify segments to improve engagement, we do not differentiate paid or free users in clustering analysis.
-With this new additional data set, average scores of each activity per user can be tabulated correctly. Furthermore, average scores of each activity per user can be broken down to observe the difference by different subscription plans and the length of membership with REO.
+===Variables Selection for Clustering Analysis===
+[[File:REO_table3.png|400px]]<br>
+As clustering analysis is an unsupervised method, it is necessary to consider all initial variables and eliminate them as inputs with reasons. Multivariate analysis was conducted for all variables. We found that “Sessions” and “Organic listings” (r = 0.9065), “Enquiries_agent” and “Enquiries_consumer” (r = 0.7344) have a correlation coefficient of above 0.7. This meant that the two pairs of variables are strongly correlated. From a statistical perspective, to prevent overrepresentation in cluster analysis, one variable from each offending pair needs to be dropped. <br>
+The variable Organic was dropped from selections in favour for Sessions variable as the variable is even more skewed as indicated by the quantiles value from the figure below.
+<br>[[File:REO_fig3.png|400px]]
+<br> Enquiries from Agent were dropped in favour of Enquiries from Consumer as well. This decision was arrived at with the input from the sponsor, who indicated that there were a significant proportion of unwanted enquiries or “spam” from agents to agents. From a business perspective, this variable might be poorer at representing actual commercial interests as compared to enquiries from consumers. <br>
+Member_Days variable was not included for clustering analysis as the sponsor mentioned that accounts could be mass purchased, thus having the same members day but not exactly indicative of the number of days spent for each agent. <br>
+Therefore, the final four variables selected for cluster analysis are: Sessions, Synced, Cobroke and Enquiries from Consumers. These clustering variables are specific and observable. The descriptions are as follows:
+<br>[[File:REO_table4.png|600px]]
+<br>Before the clustering analysis was conducted, observations with no sessions over the past six months were removed. These users are likely to have abandoned the usage of the portal and thus, meaningful observations are unlikely to surface from studying them through the current dataset. With their removal using the data filter, 7895 observations were excluded from the initial 20762. These observations will be grouped in Cluster Number 0 as abandoned users.
+<br>[[File:REO_fig4.png|600px]]
+<br>All continuous variables’ distributions were reviewed to check for spread of the data. As seen above, the variables are highly skewed. Therefore, transformation was to be conducted through the global Johnson transformation option in cluster analysis. As the range of values for each of the variables were reasonably close, there was no need for standardization.
+<br>[[File:REO_fig5.png|600px]]
+<br>After conducting the global transformation, the observations appear to follow more of a normal distribution curve. However, it is evident that the Global Johnson Transformation did not fully resolve the outlier issues.

Difference between revisions of "REO Project Findings"

Latest revision as of 15:17, 16 April 2018

For Exploratory Data Analysis

For Clustering Analysis

Variables Selection for Clustering Analysis

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools