REO Project Proposal

From Analytics Practicum
Revision as of 22:38, 15 April 2018 by Gtong.2014 (talk | contribs) (→‎Project Objectives)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Back to ANLY482 AY2017-18 Home Page

HOME

 

ABOUT US

 

PROJECT PROPOSAL

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 


Project Motivation

Despite being a perfect avenue to collect various usage data of both the commercial users and house seekers, REO fails to fully utilise the vast amount of collected data. This is attributed to due to the lack of cleaning or preparation for further analysis. This is further exacerbated by the fact that REO has been present since 2014, which means it has 4 years’ worth of data unexplored. Meanwhile, REO faces strong competition from either similar sites with greater scale or smaller sites with a strong niche e.g. specializing in new condominiums.
Recent news articles have explored how property market is going to get ‘hot’ due to the increase in demand. According to URA’s Real Estate Information System (REALIS), the property price index of residential property has increased for the first time since quarter 4 in 2013 and the planned supply of property from 4 years from now would increase as well. Given that the increase in both demand and supply, REO needs to capitalise on the growing market for its growth.

Project Objectives

Business Objectives
As their revenue model is reliant on the subscription fees and premium features used, the focus of this project would be on enhancing engagement with the commercial users through identifying user segments and developing segment targeting strategies. With better engagement, they hope to reduce attrition rate and increase potential user base.

Technical Objectives
To use data analytical tools and statistical methods to study the data and derive insights that may help to achieve the business objective.

  • To understand the data domains
  • To understand the usage rate and activities performed on the portals by the commercial users
  • To identify patterns or trends with among the commercial users behaviour
  • To identify customer segments through clustering analysis

Project Data

REO provided 6 separate sets of data. The datasheets are Users, Subscription, Sessions, Enquiries, Cobroke and Listings.
Data Dictionary
Metadata.png

Project Methodology

Data Preparation
The current data are in six distinct sheets. The various variables will be matched to their respective unique customer ID and combined into a single spreadsheet.

Data Transformation
The descriptive statistics suggested that the datasets are skewed towards the right as their arithmetic mean are larger than the median value. Based on the analysis requirements, the team may trim the outliers and/or standardise the values.

Data Exploration
Distribution analysis would be conducted on all the data provided. An independent t-test or any equivalent test should be conducted to observe for any difference between REO’s commercial users – the subscribers (paid users) and non-subscribers (free users).

Clustering Analysis
Cluster analysis is used to form meaningful distinct clusters of relatively homogenous observations based on several measurements. This is helpful to provide different market segments and influence segment targeting strategies. There are two types of clustering algorithms for a dataset namely hierarchical and non-hierarchical methods. Given the large dataset, non-hierarchical methods would be more suitable. The most used non-hierarchical cluster analysis is k-means clustering which is an “iterative algorithm” that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. The variables should not be highly correlated with one another because the specific aspects covered by these variables will be overrepresented in the cluster solution. The next steps are to ensure that each clustering variable should have similar ranges with other variables and that they are not skewed. This is done to ensure that the clusters are homogenous within and heterogeneous across. It is also suggested that a reasonable number of clusters cannot be easily determined visually, so the optimal allocation depends on the nature of the required data and results.
The ideal number of clusters can be determined using Cubic Clustering Criterion (CCC) through Ward’s minimum variance method. Larger positive values indicate a better solution as it shows a larger difference from a uniform distribution. However, the influence of extreme outliers and relatively skewed distribution may result in CCC having sporadic relationship with the number of clusters. This is exacerbated if CCC is negative for the ideal number of clusters. As such, for data sets with multiple variables with highly skewed data, k-means clustering may not be the best method for cluster analysis.
Unlike k-means clustering, Latent Class Analysis (LCA) is a statistical method for finding subtypes of related cases (latent classes) from multivariate categorical data. The results of LCA can be used to classify cases to their most likely latent classes. LCA is commonly used in areas like social science fields like psychology and sociology, health research and marketing research. This clustering algorithm offers several advantages over other clustering approaches such as K-means such as assigning a probability to the cluster membership for each data point instead of relying on the distances to biased cluster means and LCA provides various diagnostic information such as common statistics, Bayesian Information Criterion (BIC), and p-value to determine the number of clusters and the significance of the variables’ effects. BIC is also preferred over Akaike Information Criterion (AIC) because BIC is more useful in selecting a correct model while AIC is more appropriate in finding the best model for predicting observations.

Scope of Work

Our dataset contain only records for 2017 Q3 & Q4, which limits any time-series analysis
Dataset only contains behavior of commercial users on their portal.

Phase 0: Context learning

  • Understanding the business model of REO, industry and competitors
  • Verifying facts/assumptions
  • Mapping out the user’s process of the portal
  • Defining what constitutes as a successful ‘conversion’ for the portal
  • Understanding the variables of the dataset

Phase 1: Data Preparation

  • Refer to “Project Methodology”

Phase 2: Data Exploration

  • Studying the distribution of the variables
  • Identifying and treating outliers
  • Comparing data among different groups
  • Classifying variable data for further analysis

Phase 3: Data Analysis

  • Conduct data segmentation using Clustering Analysis
  • Use of K-Means and Latent Class Analysis