ANLY482 AY2017-18T2 Group01: Project Management

From Analytics Practicum
Revision as of 23:09, 14 April 2018 by Trohera.2014 (talk | contribs)
Jump to navigation Jump to search


Home

Team

Project Overview

Project Findings

Project Management

Documentation

Main Page


Project Outline

We went through the following steps to execute our project:

Project ScopeEatigo.png


Understanding the Business Problem

We first wanted to gather all information related to the booking journey of customers and the key influencing factors. To do this we carried out the following steps:
1. Spoke with eatigo employees beyond our sponsor (Sales Lead and Marketing Lead)
2. Interviewed eatigo customers to understand their typical priorities and their customer journey. Based on our understanding, we have mapped out the customer journey map as follows:

Journey final.png


Data Preparation and Cleaning

The next step involved zooming into the data sheets provided to us and understand each variable, it's sufficiency and relevance in helping us solve the business problem, and then preparing it for analysis. These were the steps in Data Preparation:

Eatigodataprep.png


The details of our work in each step has been summarized as follows:

DataCleaningEatigo.png


After this, we conducted an exploratory data analysis. However, due to confidentiality reasons, we are not detailing the visualizations. If required, please free to contact us for more information [(Contact Details)] .

Clustering Methodology


We followed the following method for our Clustering:

ClusteringMethodEatigo.png



Choosing the Clustering Method:

Literature Review:

Deciding on K-Means:

Based on our understanding of the literature and in consultation with our sponsor and supervisor, we decided on K-Means clustering as the appropriate clustering technique because of the following reasons: i) K-means clustering is suitable for large data sets ii) K-means is less affected by outliers and presence of irrelevant variables


K-Means Clustering


How K-Means Works

K-means algorithm is a kind of data partitioning method used for clustering. K stands for the number of clusters and the algorithm identifies clusters their means. K means clustering works on the fundamental concept of Euclidean distance. Euclidean distance is a distance measure based on the location of points in the space. A distinction can be made here from Non-euclideanNon-Euclidean distances which is based on properties of point in a space. Like other clustering algorithms, the ultimate aim of K means clustering is to group objects in such a way that the they are homogenous within a cluster and heterogenous across clusters . This it ensures by minimizing within cluster variation. K means clustering is an iterative process. It starts by randomly assigning objects to a pre-specified number of clusters based on Euclidean distance of the points from the cluster centers or centroids. In the iterations that follow, objects are reassigned to other clusters so as to minimize the within cluster variation. If the reassignment of an object reduces the within cluster variation, that object is reallocated to that cluster.
1. Initialize:
Pre-specify the number of clusters and arbitrarily choose objects as cluster centres. Let clusters be represented by k=1,2,…,K. This is the 0th iteration i.e. h=0

2. Assign Clusters:
Assign object i to cluster k: the cluster with the closest mean based on Euclidean distance. The Euclidean distance between object i and cluster mean μ_k^h in iteration h is calculated as:

Assigning Clusters.png

Where j=1,…,n represents the clustering variables. Object i is assigned to the cluster to satisfy the following:

Gi.png


The argument ensures that the Euclidean distance from the object,i to the group g, is minimized

3. Compute Cluster Means:
With the fixed cluster assignments calculate the cluster means for each of the variables

MeanDistance.png


Therefore, the new estimate of cluster mean is the simple average of all observations assigned to it.

4. Compute SSE
Sum of Squared Errors (SSE) measures the total within cluster variations. The error gives distance between each observation and its closest cluster mean.

SSE.png


5. Loop:
The steps from Step 2 onwards are repeated to reassign objects to clusters by comparing the Euclidean distance across clusters. Since the cluster centres’ positions keep changing with each iteration, the clustering solution keeps changing as well. This is done until the convergence criterion is satisfied. Convergence criterion is usually that there is no change in cluster affiliations i.e. change in SSE is small.

Data Preparation for K-Means
1. Integrating the Data Sets
To decide on which variables would be appropriate for clustering we relied on data availability and intuition. We used the integrated data sheet to extract the 136,177 users and variables that would be most relevant to our clustering. JMP Pro was used to then create a clustering data sheet – first, separate tables were created for each variable by grouping the data by User ID and then the tables were joined together by matching User ID. This sheet was further filtered to exclude the following set of users:

Users Excluded.png


After filtering, the clustering sheet consists of 78,483 user records with each row representing a unique user and each column corresponding to variables that represent users’ characteristics.

2. Preparing Variables
Step 1: Variable Creation
New variables: To segment customers along the Recency, Frequency and Monetary (RFM) dimensions, we created appropriate variables using the data available to us. We also created variables which will allowed us to classify users based on their variety seeking behaviour when it comes to restaurants and cuisines. These are shown below:

New Variables.png


Step 2: Converting Categorical Variables into Continuous
Most of the variables in our dataset are categorical variables such as Booking Discount (10-15, 20-25, 30-35, 40-45, 50) Restaurant Tier (Tier 1, Tier 2, Tier3, Tier 4, Tier 5) etc. As discusses earlier, K-means uses cluster means to calculate the Euclidean distance. Since, means is not the central tendency measure for categorical variables, K-means algorithm cannot be applied to categorical variables in their raw form to perform clustering. Therefore, we first converted categorical variables into user-centric continuous variables by splitting the variables into its different levels and then calculated them as proportions (by dividing by total number of bookings) to make the variables more meaningful for clustering. For some of the categorical variables such as Tier and Meal Time, one level dominated the others and therefore proportions were binned to ensure that the variables will sufficiently differentiate the segments. [5] An example of how we treated categorical variables is shown below:

Categorical Variables.png


Step 3: Tests for Collinearity
A high correlation amongst clustering variable can lead to misrepresentation of clusters since the variables are not sufficiently unique to identify different clusters. If variables are highly correlated (~90%), aspects of these variables will be overrepresented as the clustering technique will not be able to differentiate between the variables. To avoid perfect multicollinearity, we excluded one level for each categorical variable. For example, for booking day split into weekday and weekend, only weekday was used as an input variable in a conceptual sense. Therefore, we first ran a MANOVA (Multivariate Analysis of Variance) test on JMP Pro, to check for correlation amongst our variables and obtained the following results:

MANOVA Table.png


From the output we see that no two different variables exhibit a very high correlation and therefore are suitable for clustering in this respect.

Step 4: Checking for Skewness and Variables Transformation
As the next step, we mapped out the variable distributions to check for outliers and skewness in the variables. In the variable distributions, the X axis represents the range of the variable and the height of the bars represent the number of users.

SkewandTransform.PNG


We confirmed with our sponsor about the outliers and did not remove them to avoid misrepresentation of user behaviour. As for skewness, most of our variables were right skewed or positively skewed i.e. the mean was greater than the median. Since it is integral to remove skewness and normalize the data before inputting into the K-means algorithm, we used the Johnson transformation to transform our variables’ distribution to normal distributions. The Johnson system of transformations has the general form of
[[File: Johnson Su.png|centre|200px|]

Work Plan


We have prepared a work plan for our project as follows.

03.png

[Our discussions with the sponsor, professor and amongst ourselves ]