Difference between revisions of "ANLY482 AY2016-17 T2 Group10 Project Overview: Methodology"

From Analytics Practicum
Jump to navigation Jump to search
 
(29 intermediate revisions by the same user not shown)
Line 43: Line 43:
 
</div>
 
</div>
 
<!------- End of Secondary Navigation Bar---->
 
<!------- End of Secondary Navigation Bar---->
 +
 +
 +
<!-- Body -->
 +
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Tools Used</strong></font></div>==
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family:Eras ITC, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
[[File:Jmppro.png|200px|center]]
 +
SAS JMP Pro 13 is chosen as our primary tool for data preparation, exploratory and further analysis. It is an analytical software that can perform most statistical analysis on large datasets and generate results with interactive visualizations used by data scientists to manipulate data selection on the go. Furthermore, tutorials and guides are widely available online for us to learn JMP Pro 13’s different techniques and functions.<br/>
 +
More importantly, its easy-to-use built-in tools enable us to conduct analysis of variance to determine relationship between interactions and sales revenue.
 +
</div>
 +
  
 
<!-- Body -->
 
<!-- Body -->
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Data Collection</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Data Preparation</strong></font></div>==
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
Data preparation took us through time-consuming and tedious procedures to obtain high quality data. Though seemingly unrewarding, a set of high quality data allows for more accurate, reliable and consistent analysis of results. Therefore, it is imperative to invest a lot of time and effort on it to avoid getting false conclusions for our hypothesis.
 +
Firstly, using SAS JMP Pro, these tables are scanned for anomalies such as missing values or outliers. We corrected them appropriately, using imputation for missing values and omission for extreme outliers. This step ensures that the data we are using will not give us misleading insights.
 +
<br/><br/>
 +
Next, to determine the causality relationship between interactions and sales revenue, we need to join Call Details and Invoice Details tables. This leads us to understand variables from both tables to find out 1) which two variables are the same and whether their formats are alike, 2) at which granularity does each row from both tables represents. Upon fully understanding them, we performed aggregations, standardizations of formats and inclusions of HCP and HCO tables to serve as links. Furthermore, we added the dimensionality of employees’ teams from Employee table, as it will be useful in describing the relationship as mentioned. Finally, we integrated relevant tables into a consolidated table and loaded it into the JMP server for analysis.
 +
<br/><br/>
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden; border: 1px solid black">[[File:Mattfig1.png|500px]]</div>
 +
<br/>
 +
The above diagram illustrates how data tables are being integrated, which took a few stages to achieve.
 +
<br/>
 +
</div>
 +
 
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Data Cleaning</strong></font></div>===
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
The data given by GSK are mainly in the form of flat files (Excel). Each contains 1 or more sheets with multiple columns. Hence the data is very high in dimensionality. Metadata is not yet available, but from column headers and the conversation with the sponsor, we have an idea on which ones will be more relevant to us. Such data include sales information, competency and results of sale staff, and data on the methods of the salespeople. These data have been promised to us.
+
The first stage of data preparation involves cleaning Invoice Details and Call Details tables.
To discover potential insights through spatial clustering analysis of sale territories, we also intend to collect spatial data from its vertical industries: hospitals, clinics and retail pharmacies. This can be easily collected from Singapore’s public data website, Data.gov.sg, in SHP or KML formats.
+
<br/>
 +
# For missing values under “price$” column in Invoice Details, we imputed their values to “0” as these rows contain records where free samples given out to customers.
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig2.png]]</div>
 +
#* Steps Taken:
 +
#*# Press Ctrl-F to show “search data tables” interface
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig3.png]]</div>
 +
#*# Enter the following fields:
 +
#*#* Find what: *
 +
#*#* Replace with 0
 +
#*#* Tick “Match entire cell value”
 +
#*#* Tick “Restrict to selected column” of “Price$”
 +
#*#* Tick “Search data”
 +
#*# Click on “Replace All” to apply change
 +
 
 +
# For negative “sales qty” and “amount$” values in Invoice Details, we did not take any action  as  they  serve  as  records  to  void  any  sales  that  has  been  cancelled. Upon aggregation by quarters, no negative values will be present.
 +
 
 +
# For 5-digits postal codes in Invoice Details, we created a new column to store the converted ‘postal code’, with data type changed from numerical to categorical and formula function used to add the missing ‘0’.
 +
#* Steps Taken:
 +
#*# Right click on Postal Code’s header -> Select “Column Info” -> Change data type to “Character”
 +
#*# Right click on Postal Code’s header -> Select “Insert Columns”
 +
#*# Right  click  new  column ->  Select  “Formula” ->  Enter  the formula  as shown below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig4.png]]</div>
 +
#*# Click on “OK” to apply change
 +
 
 +
# To facilitate understanding quarterly performance of sales revenue in Invoice Details and Call Details, we added a “year-quarter” column for use during aggregation.
 +
#* Steps Taken:
 +
#*# Right click on “Invoice Date” of Invoice Details / “Date” of Call Details
 +
#*# Select  “New  Formula  Column” -> “Date  Time” ->  “Year  Quarter”  to create the new column
 +
# For “Product Names” differences in Invoice Details and Call Details, we standardized a common format using the recode function. (No screenshots due to confidentiality clause)
 +
#* Steps Taken:
 +
#*# Click on “Product Name” of Invoice Details / “Product” of Call Details
 +
#*# Go to “Cols” at the top -> Select “Recode”
 +
#*# At the “recode” interface, input new values that are the standardized common format
 +
#*# Click on “Done” and “In Place” to replace the old values
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->
  
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Adding Dimensionalities</strong></font></div>===
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
The second stage of data preparation involves adding dimensionalities to Invoice Details and Call Details tables using HCO, HCP and Employee tables.
  
<!-- Body -->
+
# Invoice Details and HCO are left outer joined to add dimensionality of clinic’s “name” needed for further joins, while keeping records in Invoice Details intact.
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em; text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Data Preparation</strong></font></div>==
+
#* Steps Taken:
 +
#*# We have identified the matching variables to be “CUSTOMER_CODE” in Invoice Details and “ZP Account” in HCO, both are IDs of clinics
 +
#*# Go to “Tables” and “Join”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig5.png]]</div>
 +
#*# Ensure  that  Left  Outer  Join  is  selected  and  output  columns  are populated before clicking “OK” to apply join
 +
 
 +
# Call Details and HCP are left outer joined to add dimensionality of “primary parent” (clinic name of individual doctors) needed for further joins, while keeping records in Call Details intact.
 +
#* Steps Taken:
 +
#*# We  have  identified  the  common  variables  to  be  “Account”  in  Call Detailsand “Name” in HCP, both are name of doctors
 +
#*# Go to “Tables” and “Join”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig6.png]]</div>
 +
#*# Ensure  that  Left  Outer  Join  is  selected  and  output  columns  are populated before clicking “OK” to apply join
 +
 
 +
# Invoice Details and Employee are left outer joined to add dimensionality of “therapy area” (sales teams) needed for further analysis, while keeping records in Call Details intact.
 +
#* Steps Taken:
 +
#*# We have identified the matching variables to be “Rep Name” in both Invoice Details and HCP, which are names of sales rep. Additionally, “year quarter” from both tables are also identified because sales reps may change their “Therapy area” (sales teams) by quarter basis.
 +
#*# We  will  utilize  another  function  “Update”  to  perform  the  same  left outer join
 +
#*# Go to “Tables” and “Update”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig7.png]]</div>
 +
#*# Click “OK” to apply update
 +
</div>
 +
 
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Data Aggregation</strong></font></div>===
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
The stage of data preparation (or data wrangling, newly termed as data preparation taken to the next level ) would involve employing techniques of ETL (Extract, Transform, Load) to form an Analytics Sandbox used for further exploratory analysis purposes. To better facilitate future analysis, we will be conducting ETL process and exploratory data analysis cyclically such that if the latter is not satisfactory, we will go back to revise the former. The entire process of data preparation will be done using JMP Pro 13, which supersedes its predecessor SAS Enterprise Guide and Miner and has capabilities in the fields of descriptive and predictive modelling required by our team.
+
The third stage of data preparation involves aggregating Invoice Details and Call Details
  
=== Data Cleaning & Transformation ===
+
# Invoice  Details are  aggregated  by  “year-quarter”  to derive  additional  columns “sum(sales  qty)”  and  “sum(amount$)”,  addition  to variables  we are  interested  at: “channel”  (clinic  type), “rep  name”, “product  name”, “name” (clinic’s  name) and “therapy area” (sales teams).
The next step would involve cleaning the data. We would need to explore the data iteratively to identify anomalous patterns which we can then eliminate. For example, there could be many different versions of records that all refer to the same thing. “GSK”, “GlaxoSmithKline”, “GlaxoSmithKline plc” all refer to the same entity. Techniques such as fuzzy cleaning and if-else rules can be implemented for standardization of variables.
+
#* Steps Taken:
Missing values will also be handled in this stage. The exact way we handle them will be determined once we take a look at the data. Our decision will be based on factors such as what data is missing, at what proportion, etc. We can choose to omit the rows with missing data from our analysis, or perhaps interpolate and impute the missing data with estimated ones. New interpreted variables (data columns) can also be created to enhance understanding and improve efficiency for further analysis.
+
#*# Aggregation will be performed using “Summary” function.
 +
#*# Go to “Tables” and “Summary”
 +
#*# Drop  fields  to Statistics  and Group  as  shown  in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig8.png]]</div>
 +
#*# Click “OK” to summarize the tables
  
=== Data Reduction ===
+
# Call  Details are  aggregated  by  “year-quarter”  and “primary  parent”  to derive additional column of “no. of rows” (interaction count), addition to variables we are interested at: “call: owner name” (sales rep’s name) and “product”.
We will also need to determine which columns to focus our analysis on. This will be done in conversations with our sponsor as we seek to understand the data. Once we have understood the metadata, we will then be able to pull out the sales and other relevant data to begin exploratory data analysis. The reason for selecting only a portion of the data is that the large dimensionality would strain computer hardware and slow analysis. Additionally, there is a large amount of data that would not be in the scope of our project. We will be focusing on sales methods and results. To streamline analysis and boost runtime, we will create individual data marts for each type of analysis that we are going to carry out.
+
#* Steps Taken:
 
+
#*# Aggregation will be performed using “Summary” function.
=== Exploratory Data Analysis ===
+
#*# Go to “Tables” and “Summary”
A descriptive analytics dashboard will be created via JMP Pro. We will seek to uncover patterns and anomalies. We will perform scatter plots and histograms to identify trends. For example, if we find that certain teams have very little face-to-face interactions with customers, they may require more confidence training or the client they have been assigned is less receptive to face-to-face meetings. Any assumptions that we have, either by preconceived notions or passed to us by GSK will also be tested in this phase.
+
#*# Drop  fields  to  Statistics  and Group  as  shown  in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig9.png]]</div>
 +
#*# Click “OK” to summarize the tables
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->
  
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Data Integration</strong></font></div>===
  
<!-- Body -->
 
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Methods of Analysis</strong></font></div>==
 
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
=== Correlations ===
+
The final stage of data preparation involves joining Invoice Details and Call Details
Some questions we hope to answer include what should the business invest in in order to achieve higher efficiency and growth and which sales method is the most efficient. For this, we could look at correlations between sales revenue and inputs. While correlation is not indicative of causation, it can be highly suggestive.
 
  
=== Cluster Analysis + Machine Learning (Artificial Neural Networks) ===
+
# Invoice Details and Call Details are inner joined by sales rep’s name, clinic’s name, product’s  name  and  year-quarter. Other  variables  present  in  the final  table  are “channel”, “therapy area”, “interaction count”, “sum(sales qty)” and “sum(amount$).
Depending on quality of data and conversations in future, we also hope to create a machine learning model that will be able to do some predictive analytics. For example, by predicting how would performance vary if we change an input resource.
+
#* Steps Taken:
We could do clustering on the client data, and then for each client cluster, we can train an artificial neural network (ANN) on the sales inputs, client characteristics and resulting revenue and thereby predict results based on sales input. This is to create a predictive model for each type of client.
+
#*# We have identified the following matching variables from Call Details and Invoice Details
After the clustering, we could also compare the revenue to the sales input to identify the more efficient teams or methods and recommend GSK to analyze them in future to uncover the reasons behind the efficiency and to spread them as best practices through the organization.
+
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig9b.png]]</div>
 +
#*# We will utilize Join function again to perform the inner join
 +
#*# Go to “Tables” and “Join”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig10.png]]</div>
 +
#*# Ensure that inner join is selected and output columns are populated before clicking “OK” to finish the join
 +
</div>
 +
<!-- End Body --->
  
=== Survival Analysis ===
 
Survival Analysis is a statistical technique used to analyze the expected duration of time until an event occurs and also one of the cornerstones of customer analytics . An event in our project context can be customer attrition (where existing customers turnover to other companies) or inventory depletion (where certain pharmaceutical products run dry). An understanding of when customer is most likely turnover or when inventory needs to be replenished enables GSK to plan in advance churn prevention efforts and engage in proactive customer communication to effectively improve sales.
 
  
 +
<!-- Body -->
 +
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>ACTUAL METHOD: Analysis of Variance (ANOVA) using Fit Y by X</strong></font></div>==
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
Analysis of Variance is a statistical method used to analyze differences among group means and their variances among and between groups. It is also a form of statistical hypothesis testing to test whether differences between pairs of group means are significant or not.
 +
<br/><br/>
 +
Prior to using ANOVA, we have attempted using linear regression to generalize the relationship between number of interactions and sales revenue. However, low R-squared values that suggest weak correlation and model not fitting the data were obtained, and these prompted us to carry out similar analysis using nonparametric tests like ANOVA.
 +
<br/><br/>
 +
The primary step to carry out ANOVA is to discretize our explanatory variable - “interaction count” into bins and as such, converting it from a numerical to categorical variable. The objective of discretization is because we wish to understand whether each of these interaction bins have significant differences between one another when it comes to sales revenue (response).
 +
<br/><br/>
 +
To define the range of interaction counts for “Low”, “Medium” and “High” interaction bins, we consulted our sponsor, who proposed that “Low” is for interaction count less than or equal to 1, “Medium” is for interaction count from 2 to 4 and “High” is for interaction count 5 and above.
 +
<br/><br/>
 +
The steps taken to discretize interaction counts into bins are as follow:
 +
# Insert new column right of Interaction Column and name it as Interaction Bin
 +
# Right click header of Interaction Bin, select “Formula”
 +
# An interface to formulate the new column is displayed
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig13.png]]</div>
 +
# Using various Conditional and Comparison functions, enter the following formula proposed by our sponsor
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig14.png]]</div>
 +
# Upon clicking “OK”, the new column will be populated with values of “low”, “medium” and “high”
 +
<br/><br/>
 +
The next step to conducting ANOVA would be to use Fit Y by X function. Fit Y by X function can detect whether response or explanatory variables selected are numerical or categorical, and selectively carry out bivariate, oneway, logistic or contingency analysis. In our scenario, our “X, Factor” or explanatory variable is interaction bins (categorical) and “Y, Response” is sales amount (numerical), thus, the analysis conducted would be oneway.
 +
<br/><br/>
 +
The steps taken to use Fit Y by X function for ANOVA is as follows:
 +
# Go to “Analyze” and “Fit Y by X”
 +
# Drop Sum(Amount$) to “Y, Response” and Interaction Bin to “X, Factor”
 +
# To look into the perspective of individual channels or therapy areas when comparing their means, we will also drop Channel or Therapy Area to “By”
 +
# Click on “OK” to get one way analysis of Sum(Amount$) by interaction bin for individual channels/ therapy areas
 +
# To get in-depth details of quantiles for each interaction bin, select the upside-down red arrow and click on “Quantiles”
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig15.png]]</div>
 +
# Red box plot for each interaction bin will appear
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig15b.png]]</div>
 +
# To conduct Tukey-Kramer HSD test for all pairs of interaction bins, select the upside-down red arrow again, and click on “Compare Means” and “All Pairs, Tukey HSD”
 +
# A few reports will appear below the graph, but our attention is on the ordered differences report, which calculates p-Value to show whether differences between means of interaction bins are significant or not. Fig 16 below is an instance of the output
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig16.png]]</div>
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->

Latest revision as of 14:01, 21 April 2017

Kesmyjxlogo.png

HOME

ABOUT US

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

Overview

Data

Methodology

<< ANLY482 AY2016-17 T2 Projects


Tools Used

Jmppro.png

SAS JMP Pro 13 is chosen as our primary tool for data preparation, exploratory and further analysis. It is an analytical software that can perform most statistical analysis on large datasets and generate results with interactive visualizations used by data scientists to manipulate data selection on the go. Furthermore, tutorials and guides are widely available online for us to learn JMP Pro 13’s different techniques and functions.
More importantly, its easy-to-use built-in tools enable us to conduct analysis of variance to determine relationship between interactions and sales revenue.


Data Preparation

Data preparation took us through time-consuming and tedious procedures to obtain high quality data. Though seemingly unrewarding, a set of high quality data allows for more accurate, reliable and consistent analysis of results. Therefore, it is imperative to invest a lot of time and effort on it to avoid getting false conclusions for our hypothesis. Firstly, using SAS JMP Pro, these tables are scanned for anomalies such as missing values or outliers. We corrected them appropriately, using imputation for missing values and omission for extreme outliers. This step ensures that the data we are using will not give us misleading insights.

Next, to determine the causality relationship between interactions and sales revenue, we need to join Call Details and Invoice Details tables. This leads us to understand variables from both tables to find out 1) which two variables are the same and whether their formats are alike, 2) at which granularity does each row from both tables represents. Upon fully understanding them, we performed aggregations, standardizations of formats and inclusions of HCP and HCO tables to serve as links. Furthermore, we added the dimensionality of employees’ teams from Employee table, as it will be useful in describing the relationship as mentioned. Finally, we integrated relevant tables into a consolidated table and loaded it into the JMP server for analysis.

Mattfig1.png


The above diagram illustrates how data tables are being integrated, which took a few stages to achieve.

Data Cleaning

The first stage of data preparation involves cleaning Invoice Details and Call Details tables.

  1. For missing values under “price$” column in Invoice Details, we imputed their values to “0” as these rows contain records where free samples given out to customers.
Mattfig2.png
    • Steps Taken:
      1. Press Ctrl-F to show “search data tables” interface
Mattfig3.png
      1. Enter the following fields:
        • Find what: *
        • Replace with 0
        • Tick “Match entire cell value”
        • Tick “Restrict to selected column” of “Price$”
        • Tick “Search data”
      2. Click on “Replace All” to apply change
  1. For negative “sales qty” and “amount$” values in Invoice Details, we did not take any action as they serve as records to void any sales that has been cancelled. Upon aggregation by quarters, no negative values will be present.
  1. For 5-digits postal codes in Invoice Details, we created a new column to store the converted ‘postal code’, with data type changed from numerical to categorical and formula function used to add the missing ‘0’.
    • Steps Taken:
      1. Right click on Postal Code’s header -> Select “Column Info” -> Change data type to “Character”
      2. Right click on Postal Code’s header -> Select “Insert Columns”
      3. Right click new column -> Select “Formula” -> Enter the formula as shown below
Mattfig4.png
      1. Click on “OK” to apply change
  1. To facilitate understanding quarterly performance of sales revenue in Invoice Details and Call Details, we added a “year-quarter” column for use during aggregation.
    • Steps Taken:
      1. Right click on “Invoice Date” of Invoice Details / “Date” of Call Details
      2. Select “New Formula Column” -> “Date Time” -> “Year Quarter” to create the new column
  2. For “Product Names” differences in Invoice Details and Call Details, we standardized a common format using the recode function. (No screenshots due to confidentiality clause)
    • Steps Taken:
      1. Click on “Product Name” of Invoice Details / “Product” of Call Details
      2. Go to “Cols” at the top -> Select “Recode”
      3. At the “recode” interface, input new values that are the standardized common format
      4. Click on “Done” and “In Place” to replace the old values

Adding Dimensionalities

The second stage of data preparation involves adding dimensionalities to Invoice Details and Call Details tables using HCO, HCP and Employee tables.

  1. Invoice Details and HCO are left outer joined to add dimensionality of clinic’s “name” needed for further joins, while keeping records in Invoice Details intact.
    • Steps Taken:
      1. We have identified the matching variables to be “CUSTOMER_CODE” in Invoice Details and “ZP Account” in HCO, both are IDs of clinics
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig5.png
      1. Ensure that Left Outer Join is selected and output columns are populated before clicking “OK” to apply join
  1. Call Details and HCP are left outer joined to add dimensionality of “primary parent” (clinic name of individual doctors) needed for further joins, while keeping records in Call Details intact.
    • Steps Taken:
      1. We have identified the common variables to be “Account” in Call Detailsand “Name” in HCP, both are name of doctors
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig6.png
      1. Ensure that Left Outer Join is selected and output columns are populated before clicking “OK” to apply join
  1. Invoice Details and Employee are left outer joined to add dimensionality of “therapy area” (sales teams) needed for further analysis, while keeping records in Call Details intact.
    • Steps Taken:
      1. We have identified the matching variables to be “Rep Name” in both Invoice Details and HCP, which are names of sales rep. Additionally, “year quarter” from both tables are also identified because sales reps may change their “Therapy area” (sales teams) by quarter basis.
      2. We will utilize another function “Update” to perform the same left outer join
      3. Go to “Tables” and “Update”
      4. Enter fields as shown in screenshot below
Mattfig7.png
      1. Click “OK” to apply update

Data Aggregation

The third stage of data preparation involves aggregating Invoice Details and Call Details

  1. Invoice Details are aggregated by “year-quarter” to derive additional columns “sum(sales qty)” and “sum(amount$)”, addition to variables we are interested at: “channel” (clinic type), “rep name”, “product name”, “name” (clinic’s name) and “therapy area” (sales teams).
    • Steps Taken:
      1. Aggregation will be performed using “Summary” function.
      2. Go to “Tables” and “Summary”
      3. Drop fields to Statistics and Group as shown in screenshot below
Mattfig8.png
      1. Click “OK” to summarize the tables
  1. Call Details are aggregated by “year-quarter” and “primary parent” to derive additional column of “no. of rows” (interaction count), addition to variables we are interested at: “call: owner name” (sales rep’s name) and “product”.
    • Steps Taken:
      1. Aggregation will be performed using “Summary” function.
      2. Go to “Tables” and “Summary”
      3. Drop fields to Statistics and Group as shown in screenshot below
Mattfig9.png
      1. Click “OK” to summarize the tables

Data Integration

The final stage of data preparation involves joining Invoice Details and Call Details

  1. Invoice Details and Call Details are inner joined by sales rep’s name, clinic’s name, product’s name and year-quarter. Other variables present in the final table are “channel”, “therapy area”, “interaction count”, “sum(sales qty)” and “sum(amount$).
    • Steps Taken:
      1. We have identified the following matching variables from Call Details and Invoice Details
Mattfig9b.png
      1. We will utilize Join function again to perform the inner join
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig10.png
      1. Ensure that inner join is selected and output columns are populated before clicking “OK” to finish the join


ACTUAL METHOD: Analysis of Variance (ANOVA) using Fit Y by X

Analysis of Variance is a statistical method used to analyze differences among group means and their variances among and between groups. It is also a form of statistical hypothesis testing to test whether differences between pairs of group means are significant or not.

Prior to using ANOVA, we have attempted using linear regression to generalize the relationship between number of interactions and sales revenue. However, low R-squared values that suggest weak correlation and model not fitting the data were obtained, and these prompted us to carry out similar analysis using nonparametric tests like ANOVA.

The primary step to carry out ANOVA is to discretize our explanatory variable - “interaction count” into bins and as such, converting it from a numerical to categorical variable. The objective of discretization is because we wish to understand whether each of these interaction bins have significant differences between one another when it comes to sales revenue (response).

To define the range of interaction counts for “Low”, “Medium” and “High” interaction bins, we consulted our sponsor, who proposed that “Low” is for interaction count less than or equal to 1, “Medium” is for interaction count from 2 to 4 and “High” is for interaction count 5 and above.

The steps taken to discretize interaction counts into bins are as follow:

  1. Insert new column right of Interaction Column and name it as Interaction Bin
  2. Right click header of Interaction Bin, select “Formula”
  3. An interface to formulate the new column is displayed
Mattfig13.png
  1. Using various Conditional and Comparison functions, enter the following formula proposed by our sponsor
Mattfig14.png
  1. Upon clicking “OK”, the new column will be populated with values of “low”, “medium” and “high”



The next step to conducting ANOVA would be to use Fit Y by X function. Fit Y by X function can detect whether response or explanatory variables selected are numerical or categorical, and selectively carry out bivariate, oneway, logistic or contingency analysis. In our scenario, our “X, Factor” or explanatory variable is interaction bins (categorical) and “Y, Response” is sales amount (numerical), thus, the analysis conducted would be oneway.

The steps taken to use Fit Y by X function for ANOVA is as follows:

  1. Go to “Analyze” and “Fit Y by X”
  2. Drop Sum(Amount$) to “Y, Response” and Interaction Bin to “X, Factor”
  3. To look into the perspective of individual channels or therapy areas when comparing their means, we will also drop Channel or Therapy Area to “By”
  4. Click on “OK” to get one way analysis of Sum(Amount$) by interaction bin for individual channels/ therapy areas
  5. To get in-depth details of quantiles for each interaction bin, select the upside-down red arrow and click on “Quantiles”
Mattfig15.png
  1. Red box plot for each interaction bin will appear
Mattfig15b.png
  1. To conduct Tukey-Kramer HSD test for all pairs of interaction bins, select the upside-down red arrow again, and click on “Compare Means” and “All Pairs, Tukey HSD”
  2. A few reports will appear below the graph, but our attention is on the ordered differences report, which calculates p-Value to show whether differences between means of interaction bins are significant or not. Fig 16 below is an instance of the output
Mattfig16.png