Difference between revisions of "ANLY482 AY2017-18T2 Group01: Project Management"

From Analytics Practicum
Jump to navigation Jump to search
Line 74: Line 74:
 
Based on our understanding of the literature and in consultation with our sponsor and supervisor, we decided on K-Means clustering as the appropriate clustering technique because of the following reasons:
 
Based on our understanding of the literature and in consultation with our sponsor and supervisor, we decided on K-Means clustering as the appropriate clustering technique because of the following reasons:
 
i) K-means clustering is suitable for large data sets  
 
i) K-means clustering is suitable for large data sets  
ii) K-means is less affected by outliers and presence of irrelevant variables
+
ii) K-means is less affected by outliers and presence of irrelevant variables <br/>
 +
 
 +
 
 +
<div style="background:#AE0000; line-height:0.3em; font-family:montserrat; font-size:120%; border-left:#FFE2C0 solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>K-Means Clustering</strong></font></div></div> <br/>
 +
''How K-Means Works'' <br/>
 +
 
 +
K-means algorithm is a kind of data partitioning method used for clustering. K stands for the number of clusters and the algorithm identifies clusters their means. K means clustering works on the fundamental concept of Euclidean distance. Euclidean distance is a distance measure based on the location of points in the space. A distinction can be made here from Non-euclideanNon-Euclidean distances which is based on properties of point in a space.
 +
Like other clustering algorithms, the ultimate aimaim of K means clustering is to group objects in such a way that the they are homogenous within a cluster and heterogenous across clusters . This it ensures by minimizing within cluster variation.  K means clustering is an iterative process. It starts by randomly assigning objects to a pre-specified number of clusters based on Euclidean distance of the points from the cluster centers or centroids. In the iterations that follow, objects are reassigned to other clusters  so as to minimize the within cluster variation. If the reassignment of an object reduces the within cluster variation, that object is reallocated to that cluster. <br/>
 +
1. Initialize: <br/>
 +
Pre-specify the number of clusters and arbitrarily choose objects as cluster centres. Let clusters be represented by k=1,2,…,K. This is the 0th iteration i.e. h=0 <br/>
 +
2. Assign Clusters: <br/>
 +
Assign object i to cluster k: the cluster with the closest mean based on Euclidean distance. The Euclidean distance between object i and cluster mean μ_k^h  in iteration h is calculated as: <br/>
 +
[[File:Assigning Clusters.png|200px|centre|]]
 +
 +
Where j=1,…,n represents the clustering variables.
 +
Object i is assigned to the cluster to satisfy the following:
 +
 
 +
 +
Where 〖argmin〗_(k )ensures that g_i^h gives that cluster k such that d_ik^h is minimized.
 +
3. Compute Cluster Means
 +
With the fixed cluster assignments calculate the cluster means for each of the variables
 +
 +
Therefore, the new estimate of cluster mean is the simple average of all observations assigned to it.
 +
4. Compute SSE
 +
Sum of Squared Errors (SSE) measures the total within cluster variations. The error gives distance between each observation and its closest cluster mean.
 +
 +
 
 +
5. Loop
 +
The steps from Step 2 onwards are repeated to reassign objects to clusters by comparing the Euclidean distance across clusters. Since the cluster centres’ positions keep changing with each iteration, the clustering solution keeps changing as well. This is done until the convergence criterion is satisfied. Convergence criterion is usually that there is no change in cluster affiliations i.e. change in SSE is small.
  
 
<div style="background:#AE0000; line-height:0.3em; font-family:montserrat; font-size:120%; border-left:#FFE2C0 solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Work Plan</strong></font></div></div>
 
<div style="background:#AE0000; line-height:0.3em; font-family:montserrat; font-size:120%; border-left:#FFE2C0 solid 15px;"><div style="border-left:#fff solid 5px; padding:15px;"><font color="#fff"><strong>Work Plan</strong></font></div></div>

Revision as of 21:56, 14 April 2018


Home

Team

Project Overview

Project Findings

Project Management

Documentation

Main Page


Project Outline

We went through the following steps to execute our project:

Project ScopeEatigo.png


Understanding the Business Problem

We first wanted to gather all information related to the booking journey of customers and the key influencing factors. To do this we carried out the following steps:
1. Spoke with eatigo employees beyond our sponsor (Sales Lead and Marketing Lead)
2. Interviewed eatigo customers to understand their typical priorities and their customer journey. Based on our understanding, we have mapped out the customer journey map as follows:

Journey final.png


Data Preparation and Cleaning

The next step involved zooming into the data sheets provided to us and understand each variable, it's sufficiency and relevance in helping us solve the business problem, and then preparing it for analysis. These were the steps in Data Preparation:

Eatigodataprep.png


The details of our work in each step has been summarized as follows:

DataCleaningEatigo.png


After this, we conducted an exploratory data analysis. However, due to confidentiality reasons, we are not detailing the visualizations. If required, please free to contact us for more information [(Contact Details)] .

Clustering Methodology


We followed the following method for our Clustering:

ClusteringMethodEatigo.png



Choosing the Clustering Method:

Literature Review:

Deciding on K-Means:

Based on our understanding of the literature and in consultation with our sponsor and supervisor, we decided on K-Means clustering as the appropriate clustering technique because of the following reasons: i) K-means clustering is suitable for large data sets ii) K-means is less affected by outliers and presence of irrelevant variables


K-Means Clustering


How K-Means Works

K-means algorithm is a kind of data partitioning method used for clustering. K stands for the number of clusters and the algorithm identifies clusters their means. K means clustering works on the fundamental concept of Euclidean distance. Euclidean distance is a distance measure based on the location of points in the space. A distinction can be made here from Non-euclideanNon-Euclidean distances which is based on properties of point in a space. Like other clustering algorithms, the ultimate aimaim of K means clustering is to group objects in such a way that the they are homogenous within a cluster and heterogenous across clusters . This it ensures by minimizing within cluster variation. K means clustering is an iterative process. It starts by randomly assigning objects to a pre-specified number of clusters based on Euclidean distance of the points from the cluster centers or centroids. In the iterations that follow, objects are reassigned to other clusters so as to minimize the within cluster variation. If the reassignment of an object reduces the within cluster variation, that object is reallocated to that cluster.
1. Initialize:
Pre-specify the number of clusters and arbitrarily choose objects as cluster centres. Let clusters be represented by k=1,2,…,K. This is the 0th iteration i.e. h=0
2. Assign Clusters:
Assign object i to cluster k: the cluster with the closest mean based on Euclidean distance. The Euclidean distance between object i and cluster mean μ_k^h in iteration h is calculated as:

Assigning Clusters.png

Where j=1,…,n represents the clustering variables. Object i is assigned to the cluster to satisfy the following:


Where 〖argmin〗_(k )ensures that g_i^h gives that cluster k such that d_ik^h is minimized. 3. Compute Cluster Means With the fixed cluster assignments calculate the cluster means for each of the variables

Therefore, the new estimate of cluster mean is the simple average of all observations assigned to it. 4. Compute SSE Sum of Squared Errors (SSE) measures the total within cluster variations. The error gives distance between each observation and its closest cluster mean.


5. Loop The steps from Step 2 onwards are repeated to reassign objects to clusters by comparing the Euclidean distance across clusters. Since the cluster centres’ positions keep changing with each iteration, the clustering solution keeps changing as well. This is done until the convergence criterion is satisfied. Convergence criterion is usually that there is no change in cluster affiliations i.e. change in SSE is small.

Work Plan


We have prepared a work plan for our project as follows.

03.png

[Our discussions with the sponsor, professor and amongst ourselves ]