ANLY482 AY2016-17 T2 Group10 Project Overview: Methodology

From Analytics Practicum
Revision as of 00:53, 16 January 2017 by Jxsim.2013 (talk | contribs)
Jump to navigation Jump to search

Kesmyjxlogo.png

HOME

ABOUT US

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

Overview

Data

Methodology

<< ANLY482 AY2016-17 T2 Projects

Data Collection

The data given by GSK are mainly in the form of flat files (Excel). Each contains 1 or more sheets with multiple columns. Hence the data is very high in dimensionality. Metadata is not yet available, but from column headers and the conversation with the sponsor, we have an idea on which ones will be more relevant to us. Such data include sales information, competency and results of sale staff, and data on the methods of the salespeople. These data have been promised to us. To discover potential insights through spatial clustering analysis of sale territories, we also intend to collect spatial data from its vertical industries: hospitals, clinics and retail pharmacies. This can be easily collected from Singapore’s public data website, Data.gov.sg, in SHP or KML formats.


Data Preparation

The stage of data preparation (or data wrangling, newly termed as data preparation taken to the next level ) would involve employing techniques of ETL (Extract, Transform, Load) to form an Analytics Sandbox used for further exploratory analysis purposes. To better facilitate future analysis, we will be conducting ETL process and exploratory data analysis cyclically such that if the latter is not satisfactory, we will go back to revise the former. The entire process of data preparation will be done using JMP Pro 13, which supersedes its predecessor SAS Enterprise Guide and Miner and has capabilities in the fields of descriptive and predictive modelling required by our team.

Data Cleaning & Transformation

The next step would involve cleaning the data. We would need to explore the data iteratively to identify anomalous patterns which we can then eliminate. For example, there could be many different versions of records that all refer to the same thing. “GSK”, “GlaxoSmithKline”, “GlaxoSmithKline plc” all refer to the same entity. Techniques such as fuzzy cleaning and if-else rules can be implemented for standardization of variables. Missing values will also be handled in this stage. The exact way we handle them will be determined once we take a look at the data. Our decision will be based on factors such as what data is missing, at what proportion, etc. We can choose to omit the rows with missing data from our analysis, or perhaps interpolate and impute the missing data with estimated ones. New interpreted variables (data columns) can also be created to enhance understanding and improve efficiency for further analysis.

Data Reduction

We will also need to determine which columns to focus our analysis on. This will be done in conversations with our sponsor as we seek to understand the data. Once we have understood the metadata, we will then be able to pull out the sales and other relevant data to begin exploratory data analysis. The reason for selecting only a portion of the data is that the large dimensionality would strain computer hardware and slow analysis. Additionally, there is a large amount of data that would not be in the scope of our project. We will be focusing on sales methods and results. To streamline analysis and boost runtime, we will create individual data marts for each type of analysis that we are going to carry out.

Exploratory Data Analysis

A descriptive analytics dashboard will be created via JMP Pro. We will seek to uncover patterns and anomalies. We will perform scatter plots and histograms to identify trends. For example, if we find that certain teams have very little face-to-face interactions with customers, they may require more confidence training or the client they have been assigned is less receptive to face-to-face meetings. Any assumptions that we have, either by preconceived notions or passed to us by GSK will also be tested in this phase.


Methods of Analysis

Correlations

Some questions we hope to answer include what should the business invest in in order to achieve higher efficiency and growth and which sales method is the most efficient. For this, we could look at correlations between sales revenue and inputs. While correlation is not indicative of causation, it can be highly suggestive.

Cluster Analysis + Machine Learning (Artificial Neural Networks)

Depending on quality of data and conversations in future, we also hope to create a machine learning model that will be able to do some predictive analytics. For example, by predicting how would performance vary if we change an input resource. We could do clustering on the client data, and then for each client cluster, we can train an artificial neural network (ANN) on the sales inputs, client characteristics and resulting revenue and thereby predict results based on sales input. This is to create a predictive model for each type of client. After the clustering, we could also compare the revenue to the sales input to identify the more efficient teams or methods and recommend GSK to analyze them in future to uncover the reasons behind the efficiency and to spread them as best practices through the organization.

Survival Analysis

Survival Analysis is a statistical technique used to analyze the expected duration of time until an event occurs and also one of the cornerstones of customer analytics . An event in our project context can be customer attrition (where existing customers turnover to other companies) or inventory depletion (where certain pharmaceutical products run dry). An understanding of when customer is most likely turnover or when inventory needs to be replenished enables GSK to plan in advance churn prevention efforts and engage in proactive customer communication to effectively improve sales.