Difference between revisions of "REO Project Findings"

From Analytics Practicum
Jump to navigation Jump to search
Line 38: Line 38:
 
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_EDA | <font color="#053B6B">Exploratory Data Analysis</font>]]
 
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_EDA | <font color="#053B6B">Exploratory Data Analysis</font>]]
  
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_Clustering | <font color="#053B6B">Clustering Analysis</font>]]
+
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: none; width:30%" | [[REO_Project_Findings_Cluster | <font color="#053B6B">Cluster Analysis</font>]]
 
|}
 
|}
  

Revision as of 23:01, 26 February 2018


Back to ANLY482 AY2017-18 Home Page

HOME

 

ABOUT US

 

PROJECT PROPOSAL

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 


Data Preparation Exploratory Data Analysis Cluster Analysis


Data Preparation

As the data were received in separated sheets, it is necessary to perform some transformation before they are ready for deeper analysis. The team only has either aggregated counts such as total number of enquiries per quarter for each user or system logs. The team are not able to analyse these data as the total counts have do not have sufficient meaning and the system logs are too messy.

Therefore, the objective of the data preparation is to convert the system logs into an intermediary stage where it analysis of value can be done easily while retaining sufficient meaning.

For Exploratory Data Analysis

Users:

  • Created “Members_Days” variable by counting the number of days passed by 31 Dec 2017 since the account was created. With the new variable, the team can analyse whether being a member for a longer time has impact on other factors.
  • Removal of users who joined after 31st December 2017 as they do not have at least 2 quarters of data available for analysis.
  • Created Status variable into a binary variable with 1 representing an “active” user. Users who were previously labelled as “blacklisted” in the dataset are removed from analysis due to their restricted activity.

Subscription:

  • Created “Paid” variable as a binary variable with 1 representing paid and 0 representing free. Previously, these statuses were represented with strings.

Sessions:

  • Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
  • Created “Session_Weekdays” and “Session_Weekends” columns based on the value from “DaysOfWeek”.
  • Created “Hours” to convert the time into their respective hours.
  • Created “WMAE” by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
  • Created columns such as “SE_E1”, “SE_E2”, “SE_E3”, “SE_E4”, “SE_D1”, “SE_D2”, “SE_D3” and “SE_D4” to indicate the frequencies of sessions taking place under the specific timeslot.
  • Created the columns “SE_DO” and “SE_EO” to indicate the total number of sessions taking place during weekdays and weekends.
  • Created “SE_TOTAL” to tabulate the sum of user sessions. The purpose of these transformation is to allow for an easier exploration on whether when the users utilise the portal has an impact on lead generation.

Listing:

  • Created “DaysOfWeek” with Sunday to Saturday being labelled from 1 to 7 respectively.
  • Created “Listing_Weekdays” and “Listing_Weekends” columns based on the value from “DaysOfWeek”.
  • Created Hours to convert the time into their respective hours.
  • Created WMAE by categorising time into four exclusive 6 hours slot starting from 0000 hrs. They are represented by 1, 2, 3 and 4 respectively.
  • Created columns such as “LI_E1”, “LI_E2”, “LI_E3”, “LI_E4”, “LI_D1”, “LI_D2”, ”LI_D3” and “LI_D4” to indicate the frequencies of user sessions taking place during each respective time slot.
  • Created the columns “LI_DO” and “LI_EO” to indicate the total number of sessions taking place during weekdays and weekends.
  • Created “LI_TOTAL” to tabulate the total sum of listings.
  • Created two binary variables - Sync and Organic to replace the original variable represented with strings.

Enquiries and Cobroke:

  • Enquiries and Cobroke are treated as the dependant variables.
  • These two variables were broken down in a similar manner to observe for any insights on the usage behaviours of home buyers.
  • These variables were binned to represent the days they were created with “DaysofWeek”. They are then sorted into the respective bins which reflect if the day is a weekday or weekend. The bins are “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO”.
  • Hours were binned into 6 hours slot – beginning from 0000 hrs. They are represented with the numbers 1 (0000hrs to 0600hrs), 2 (0600hrs to 1200hrs), 3 (1200hrs to 1800hrs) and 4 (1800hrs to 0000hrs). Created columns such as “EN_E1”, “EN_E2”, “EN_E3”, “EN_E4”, “EN_D1”, “EN_D2”, “EN_D3” and “EN_D4” to indicate the frequencies of user sessions taking place during each respective time slot. This treatment was not given to Cobroke because the team focuses on the leads acquire from primary home buyers.
  • Created the columns “CO_DO”, “CO_EO”, “EN_DO” and “EN_EO” to indicate the total number of enquiries and cobroke request received during weekdays and weekends.
  • Created “EN_TOTAL” and “CO_TOTAL” by summing “DO” and “EO” variables to reflect the sum of enquiries and cobroke requests each user receives.

For Construction of Dashboard

As mentioned in the earlier section, the purpose of the dashboard is meant for internal communication within REO. Therefore, it is essential that key trends on agents’ behaviour can be observed at a glance.

For the purpose of co-creating with the project sponsor, the team had decided to prepare a draft for the interim presentation. This is to ensure that the team can have a firmer grasp of the sponsor’s requirement. Most importantly, this is to eliminate any possibility of disagreement during final delivery.

The features included on the draft version of the dashboard are:

  • Trend analysis: to observe the movement of trends and identify periods with spikes/dips in activity.
  • Patterns identification: to observe for any user pattern based on hours/days.
  • User analysis: to understand the makeup of users at a quick glance with metrics such as membership length and proportion of paid/free users.

The team expects to make amendments to the available features in future iterations.

Steps and Rationale

As the team was unfamiliar with Google Data Studio, they experienced many difficulties from working with the tool. They are the inability to upload certain data files with constraints such as format (requires UTF-8 encoded file) and size. Additionally, the team must relearn the features and functions of the software to achieve the desired visualisation that best represent the data. Furthermore, the process of validating the accuracy of the results aggregated by the software was a tedious job. Any discrepancy was an indication that the uploaded merged data file was not tabulated correctly. There were several iterations being made before the data was finally able to be used effectively by the software.

The team could not simply combine all the data together in its rawest form onto the Data Studio all together because the combined file was too large with more than 2 million rows. The team then decided to reduce the file size while ensuring that the uploaded data are sufficient to construct the required visualisations.

The team first decided to merge the data in terms of date and hours to observe the trends of the activities conducted by users:

Sessions, Listings, Enquiries, Cobroke:

  • Created a “Date” variable to only capture the date portion from the “created_date”’s format of <dd:mm:yyyy hh:mm:ss>.
  • Each data files are tabulated with columns “Date”, “Hours”, “# Sum of variables”.
  • Starting with the data table tabulated from the Sessions file, files are being merged by matching columns “Data” and “Hours”.
  • For missing values found in some of the rows were all recode as value 0. As there were no activity, the user log did not contain any value, thus, it may appear to be missing values.
  • The final merged file [dashboard1.0] contains columns “Date”, “Hours”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.

The above data file does not capture any users’ characteristics such as their subscription plan or their membership length with REO due to the inability to upload the full dataset on Google Data Studio. Therefore, the team explored alternative ways to represent all the required data in a single sheet. The sheet captures the activity level of each user on separate dates. At a single glance, one can tell when a user is active and the total activity count for that specific date.

Users:

  • Created “Mem_Bin” variable which bin the “Members_Days” by years (<1 year, <2 years, <3 years, >3 years), to study the difference in the activity level by users of different membership length.

Sessions, Listings, Enquiries, Cobroke:

  • Each data files are tabulated with columns “user_id”, “Date”, “#Sum of variable”
  • For Users, Subscription, and Sessions files, they were merged if the “user_id” are present on all 3 sheets. Users lacking data in these areas were eliminated for further analysis as the team is not able to obtain a complete view of such users. For example, a user without the subscription variable cannot be included under paid nor free users.
  • For missing values found in activities rows except sessions were all recode as value 0, as the missing values represented no activity being made by the particular user at that date.

The final merged file [dashboard5.0] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Date”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.

While the data file [dashboard 5.0] were accurately merged and represented on Google Data Studio, the aggregated average score of each activity for different profile of the users are inaccurately represented. For example, having multiple and uneven amount of entries belonging to the same user had caused the dashboard to wrongly compute statistics such as proportion of free to paid users. This was due to the nature of the data which has “users_id” duplicated for each “Date” where activity happened. Therefore, a tabulated file was made to remove the “Date” information and collate the activity level by each user over the same period.

The new tabulated file [dashboard5.2] contains columns “user_id”, “Paid”, “Mem_bin”, “Member_Days”, “Sessions”, “Organic”, “Synced”, “Listing”, “Enquiries”, “Cobroke”.

With this new additional data set, average scores of each activity per user can be tabulated correctly. Furthermore, average scores of each activity per user can be broken down to observe the difference by different subscription plans and the length of membership with REO.