Difference between revisions of "ANLY482 AY2016-17 T2 Group10 Project Overview: Methodology"

From Analytics Practicum
Jump to navigation Jump to search
 
(24 intermediate revisions by the same user not shown)
Line 43: Line 43:
 
</div>
 
</div>
 
<!------- End of Secondary Navigation Bar---->
 
<!------- End of Secondary Navigation Bar---->
 +
 +
 +
<!-- Body -->
 +
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Tools Used</strong></font></div>==
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family:Eras ITC, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
[[File:Jmppro.png|200px|center]]
 +
SAS JMP Pro 13 is chosen as our primary tool for data preparation, exploratory and further analysis. It is an analytical software that can perform most statistical analysis on large datasets and generate results with interactive visualizations used by data scientists to manipulate data selection on the go. Furthermore, tutorials and guides are widely available online for us to learn JMP Pro 13’s different techniques and functions.<br/>
 +
More importantly, its easy-to-use built-in tools enable us to conduct analysis of variance to determine relationship between interactions and sales revenue.
 +
</div>
 +
  
 
<!-- Body -->
 
<!-- Body -->
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Data Preparation</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Data Preparation</strong></font></div>==
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
Data preparation took us through time-consuming and tedious procedures to obtain high quality data. Though seemingly unrewarding, a set of high quality data allows for more accurate, reliable and consistent analysis of results. Therefore, it is imperative to invest a lot of time and effort on it to avoid getting false conclusions for our hypothesis.
 +
Firstly, using SAS JMP Pro, these tables are scanned for anomalies such as missing values or outliers. We corrected them appropriately, using imputation for missing values and omission for extreme outliers. This step ensures that the data we are using will not give us misleading insights.
 +
<br/><br/>
 +
Next, to determine the causality relationship between interactions and sales revenue, we need to join Call Details and Invoice Details tables. This leads us to understand variables from both tables to find out 1) which two variables are the same and whether their formats are alike, 2) at which granularity does each row from both tables represents. Upon fully understanding them, we performed aggregations, standardizations of formats and inclusions of HCP and HCO tables to serve as links. Furthermore, we added the dimensionality of employees’ teams from Employee table, as it will be useful in describing the relationship as mentioned. Finally, we integrated relevant tables into a consolidated table and loaded it into the JMP server for analysis.
 +
<br/><br/>
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden; border: 1px solid black">[[File:Mattfig1.png|500px]]</div>
 +
<br/>
 +
The above diagram illustrates how data tables are being integrated, which took a few stages to achieve.
 +
<br/>
 +
</div>
 +
 
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Data Cleaning</strong></font></div>===
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
Data preparation involves cleaning, transformation, and integration, which are standard procedures to standardize data across different datasets for their many formats, errors in data entries and granularity. We will first look at each of the data files, determine best ways to standardize formats and then perform aggregations on more granular data for integration purposes.  
+
The first stage of data preparation involves cleaning Invoice Details and Call Details tables.
 +
<br/>
 +
# For missing values under “price$” column in Invoice Details, we imputed their values to “0” as these rows contain records where free samples given out to customers.
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig2.png]]</div>
 +
#* Steps Taken:
 +
#*# Press Ctrl-F to show “search data tables” interface
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig3.png]]</div>
 +
#*# Enter the following fields:
 +
#*#* Find what: *
 +
#*#* Replace with 0
 +
#*#* Tick “Match entire cell value”
 +
#*#* Tick “Restrict to selected column” of “Price$”
 +
#*#* Tick “Search data”
 +
#*# Click on “Replace All” to apply change
 +
 
 +
# For negative “sales qty” and “amount$” values in Invoice Details, we did not take any action  as  they  serve  as  records  to  void  any  sales  that  has  been  cancelled. Upon aggregation by quarters, no negative values will be present.
 +
 
 +
# For 5-digits postal codes in Invoice Details, we created a new column to store the converted ‘postal code’, with data type changed from numerical to categorical and formula function used to add the missing ‘0’.
 +
#* Steps Taken:
 +
#*# Right click on Postal Code’s header -> Select “Column Info” -> Change data type to “Character”
 +
#*# Right click on Postal Code’s header -> Select “Insert Columns”
 +
#*# Right  click  new  column ->  Select  “Formula” ->  Enter  the  formula  as shown below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig4.png]]</div>
 +
#*# Click on “OK” to apply change
 +
 
 +
# To facilitate understanding quarterly performance of sales revenue in Invoice Details and Call Details, we added a “year-quarter” column for use during aggregation.
 +
#* Steps Taken:
 +
#*# Right click on “Invoice Date” of Invoice Details / “Date” of Call Details
 +
#*# Select  “New  Formula  Column” -> “Date  Time” ->  “Year  Quarter”  to create the new column
 +
# For “Product Names” differences in Invoice Details and Call Details, we standardized a common format using the recode function. (No screenshots due to confidentiality clause)
 +
#* Steps Taken:
 +
#*# Click on “Product Name” of Invoice Details / “Product” of Call Details
 +
#*# Go to “Cols” at the top -> Select “Recode”
 +
#*# At the “recode” interface, input new values that are the standardized common format
 +
#*# Click on “Done” and “In Place” to replace the old values
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->
  
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Adding Dimensionalities</strong></font></div>===
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
The second stage of data preparation involves adding dimensionalities to Invoice Details and Call Details tables using HCO, HCP and Employee tables.
 +
 +
# Invoice Details and HCO are left outer joined to add dimensionality of clinic’s “name” needed for further joins, while keeping records in Invoice Details intact.
 +
#* Steps Taken:
 +
#*# We have identified the matching variables to be “CUSTOMER_CODE” in Invoice Details and “ZP Account” in HCO, both are IDs of clinics
 +
#*# Go to “Tables” and “Join”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig5.png]]</div>
 +
#*# Ensure  that  Left  Outer  Join  is  selected  and  output  columns  are populated before clicking “OK” to apply join
 +
 +
# Call Details and HCP are left outer joined to add dimensionality of “primary parent” (clinic name of individual doctors) needed for further joins, while keeping records in Call Details intact.
 +
#* Steps Taken:
 +
#*# We  have  identified  the  common  variables  to  be  “Account”  in  Call Detailsand “Name” in HCP, both are name of doctors
 +
#*# Go to “Tables” and “Join”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig6.png]]</div>
 +
#*# Ensure  that  Left  Outer  Join  is  selected  and  output  columns  are populated before clicking “OK” to apply join
 +
 +
# Invoice Details and Employee are left outer joined to add dimensionality of “therapy area” (sales teams) needed for further analysis, while keeping records in Call Details intact.
 +
#* Steps Taken:
 +
#*# We have identified the matching variables to be “Rep Name” in both Invoice Details and HCP, which are names of sales rep. Additionally, “year quarter” from both tables are also identified because sales reps may change their “Therapy area” (sales teams) by quarter basis.
 +
#*# We  will  utilize  another  function  “Update”  to  perform  the  same  left outer join
 +
#*# Go to “Tables” and “Update”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig7.png]]</div>
 +
#*# Click “OK” to apply update
 +
</div>
 +
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Data Aggregation</strong></font></div>===
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
The third stage of data preparation involves aggregating Invoice Details and Call Details
 +
 +
# Invoice  Details are  aggregated  by  “year-quarter”  to  derive  additional  columns “sum(sales  qty)”  and  “sum(amount$)”,  addition  to  variables  we  are  interested  at: “channel”  (clinic  type),  “rep  name”,  “product  name”,  “name” (clinic’s  name) and “therapy area” (sales teams).
 +
#* Steps Taken:
 +
#*# Aggregation will be performed using “Summary” function.
 +
#*# Go to “Tables” and “Summary”
 +
#*# Drop  fields  to  Statistics  and  Group  as  shown  in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig8.png]]</div>
 +
#*# Click “OK” to summarize the tables
 +
 +
# Call  Details are  aggregated  by  “year-quarter”  and  “primary  parent”  to  derive additional column of “no. of rows” (interaction count), addition to variables we are interested at: “call: owner name” (sales rep’s name) and “product”.
 +
#* Steps Taken:
 +
#*# Aggregation will be performed using “Summary” function.
 +
#*# Go to “Tables” and “Summary”
 +
#*# Drop  fields  to  Statistics  and  Group  as  shown  in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig9.png]]</div>
 +
#*# Click “OK” to summarize the tables
 +
</div>
 +
<!-- End Body --->
 +
 +
===<div style="background: #1b96fe;padding:0.6em; letter-spacing:0.1em; line-height: 0.7em; border-radius:20px; font-size:15px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; display: inline-block; margin-bottom:10px"><font color= #fff><strong>Data Integration</strong></font></div>===
  
<!-- Body -->
 
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>MCCP</strong></font></div>==
 
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
The final stage of data preparation involves joining Invoice Details and Call Details
 +
 +
# Invoice Details and Call Details are inner joined by sales rep’s name, clinic’s name, product’s  name  and  year-quarter.  Other  variables  present  in  the  final  table  are “channel”, “therapy area”, “interaction count”, “sum(sales qty)” and “sum(amount$).
 +
#* Steps Taken:
 +
#*# We have identified the following matching variables from Call Details and Invoice Details
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig9b.png]]</div>
 +
#*# We will utilize Join function again to perform the inner join
 +
#*# Go to “Tables” and “Join”
 +
#*# Enter fields as shown in screenshot below
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig10.png]]</div>
 +
#*# Ensure that inner join is selected and output columns are populated before clicking “OK” to finish the join
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->
Line 60: Line 179:
  
 
<!-- Body -->
 
<!-- Body -->
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Invoice Details</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>ACTUAL METHOD: Analysis of Variance (ANOVA) using Fit Y by X</strong></font></div>==
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
===<span style="line-height: 0.1em;text-indent: 10px;background-color:#1b96fe;padding:5px;border-radius:5px;font-size:15px"><font color="white">Data Cleaning</font></span>===
+
Analysis of Variance is a statistical method used to analyze differences among group means and their variances among and between groups. It is also a form of statistical hypothesis testing to test whether differences between pairs of group means are significant or not.  
A brief scan of the entire Invoice Details data table led to 3 main areas to be cleaned.
+
<br/><br/>
# Missing values in Price$ column
+
Prior to using ANOVA, we have attempted using linear regression to generalize the relationship between number of interactions and sales revenue. However, low R-squared values that suggest weak correlation and model not fitting the data were obtained, and these prompted us to carry out similar analysis using nonparametric tests like ANOVA.
# Negative values in Sales Qty and Amount$ columns
+
<br/><br/>
# Some Postal Code with only 5 digits (because they start with 0)
+
The primary step to carry out ANOVA is to discretize our explanatory variable - “interaction count” into bins and as such, converting it from a numerical to categorical variable. The objective of discretization is because we wish to understand whether each of these interaction bins have significant differences between one another when it comes to sales revenue (response).
<br/>
+
<br/><br/>
====Handling of missing values in Price$ column====
+
To define the range of interaction counts for “Low”, “Medium” and “High” interaction bins, we consulted our sponsor, who proposed that “Low” is for interaction count less than or equal to 1, “Medium” is for interaction count from 2 to 4 and “High” is for interaction count 5 and above.  
The Price$ column determines the unit price of a specific dosage (SKU) of a drug and it can vary across different customers, time for different reasons (marketing, incentive for new purchase, etc). It becomes important for us to know why some of them have missing values because the unit price of any drug is usually defined before any purchase.<br/>
+
<br/><br/>
Upon close inspection on the missing values using data filter, we are made known the following:
+
The steps taken to discretize interaction counts into bins are as follow:
* 2379 rows with missing
+
# Insert new column right of Interaction Column and name it as Interaction Bin
* 1677 rows belong to product E/F
+
# Right click header of Interaction Bin, select “Formula”
* Most records have sales amount which are $0
+
# An interface to formulate the new column is displayed
* Either Bonus Qty or Sample Qty are positive
+
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig13.png]]</div>
This tell us that these rows represented transactions that took place when drugs are given as samples or bonuses to serve as goodwill.<br/>
+
# Using various Conditional and Comparison functions, enter the following formula proposed by our sponsor
Actions taken:
+
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig14.png]]</div>
* We will be assigning a fair value of 0 to the missing values as JMP will ignore rows which have missing values if we were to take into consideration of price in our predictive analysis.
+
# Upon clicking “OK”, the new column will be populated with values of “low”, “medium” and “high”
<br/>
+
<br/><br/>
====Handling of negative values in Sales Qty and Amount$ columns====
+
The next step to conducting ANOVA would be to use Fit Y by X function. Fit Y by X function can detect whether response or explanatory variables selected are numerical or categorical, and selectively carry out bivariate, oneway, logistic or contingency analysis. In our scenario, our “X, Factor” or explanatory variable is interaction bins (categorical) and “Y, Response” is sales amount (numerical), thus, the analysis conducted would be oneway.
The Sales Qty and Amount$ columns indicate the quantity of drug and total amount involved in the transaction for a dosage (SKU) of a drug. Out of pure curiosity of the presence of negative values, we asked Elaine and she explained that negative values are credit amount which are needed to offset the initial sales.<br/>
+
<br/><br/>
We are made known of the following:
+
The steps taken to use Fit Y by X function for ANOVA is as follows:
* 1233 rows with both Sales Qty and Amount$ negative
+
# Go to “Analyze” and “Fit Y by X”
** Have corresponding transactions
+
# Drop Sum(Amount$) to “Y, Response” and Interaction Bin to “X, Factor”
* 229 rows with only Amount$ negative, all Sales Qty = 0, all Price$ have missing values
+
# To look into the perspective of individual channels or therapy areas when comparing their means, we will also drop Channel or Therapy Area to “By”
** No corresponding transactions
+
# Click on “OK” to get one way analysis of Sum(Amount$) by interaction bin for individual channels/ therapy areas
Actions taken:
+
# To get in-depth details of quantiles for each interaction bin, select the upside-down red arrow and click on “Quantiles”
* For the 229 rows, there is no indication of what the credit sales could be for and hence, we will filter them out.
+
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig15.png]]</div>
* For the 1233 rows, we can simply ignore them as the corresponding transactions will cancel them out.
+
# Red box plot for each interaction bin will appear
<br/>
+
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig15b.png]]</div>
====Handling of Postal Code with only 5 digits====
+
# To conduct Tukey-Kramer HSD test for all pairs of interaction bins, select the upside-down red arrow again, and click on “Compare Means” and “All Pairs, Tukey HSD”
Initial import of the data table assumes that the Postal Code is a numerical variable instead of a categorical one. This causes some of the values which starts with 0 to be omitted. This is an easy fix whereby we only need to perform the recode function available in JMP.
+
# A few reports will appear below the graph, but our attention is on the ordered differences report, which calculates p-Value to show whether differences between means of interaction bins are significant or not. Fig 16 below is an instance of the output
 +
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig16.png]]</div>
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->

Latest revision as of 14:01, 21 April 2017

Kesmyjxlogo.png

HOME

ABOUT US

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

Overview

Data

Methodology

<< ANLY482 AY2016-17 T2 Projects


Tools Used

Jmppro.png

SAS JMP Pro 13 is chosen as our primary tool for data preparation, exploratory and further analysis. It is an analytical software that can perform most statistical analysis on large datasets and generate results with interactive visualizations used by data scientists to manipulate data selection on the go. Furthermore, tutorials and guides are widely available online for us to learn JMP Pro 13’s different techniques and functions.
More importantly, its easy-to-use built-in tools enable us to conduct analysis of variance to determine relationship between interactions and sales revenue.


Data Preparation

Data preparation took us through time-consuming and tedious procedures to obtain high quality data. Though seemingly unrewarding, a set of high quality data allows for more accurate, reliable and consistent analysis of results. Therefore, it is imperative to invest a lot of time and effort on it to avoid getting false conclusions for our hypothesis. Firstly, using SAS JMP Pro, these tables are scanned for anomalies such as missing values or outliers. We corrected them appropriately, using imputation for missing values and omission for extreme outliers. This step ensures that the data we are using will not give us misleading insights.

Next, to determine the causality relationship between interactions and sales revenue, we need to join Call Details and Invoice Details tables. This leads us to understand variables from both tables to find out 1) which two variables are the same and whether their formats are alike, 2) at which granularity does each row from both tables represents. Upon fully understanding them, we performed aggregations, standardizations of formats and inclusions of HCP and HCO tables to serve as links. Furthermore, we added the dimensionality of employees’ teams from Employee table, as it will be useful in describing the relationship as mentioned. Finally, we integrated relevant tables into a consolidated table and loaded it into the JMP server for analysis.

Mattfig1.png


The above diagram illustrates how data tables are being integrated, which took a few stages to achieve.

Data Cleaning

The first stage of data preparation involves cleaning Invoice Details and Call Details tables.

  1. For missing values under “price$” column in Invoice Details, we imputed their values to “0” as these rows contain records where free samples given out to customers.
Mattfig2.png
    • Steps Taken:
      1. Press Ctrl-F to show “search data tables” interface
Mattfig3.png
      1. Enter the following fields:
        • Find what: *
        • Replace with 0
        • Tick “Match entire cell value”
        • Tick “Restrict to selected column” of “Price$”
        • Tick “Search data”
      2. Click on “Replace All” to apply change
  1. For negative “sales qty” and “amount$” values in Invoice Details, we did not take any action as they serve as records to void any sales that has been cancelled. Upon aggregation by quarters, no negative values will be present.
  1. For 5-digits postal codes in Invoice Details, we created a new column to store the converted ‘postal code’, with data type changed from numerical to categorical and formula function used to add the missing ‘0’.
    • Steps Taken:
      1. Right click on Postal Code’s header -> Select “Column Info” -> Change data type to “Character”
      2. Right click on Postal Code’s header -> Select “Insert Columns”
      3. Right click new column -> Select “Formula” -> Enter the formula as shown below
Mattfig4.png
      1. Click on “OK” to apply change
  1. To facilitate understanding quarterly performance of sales revenue in Invoice Details and Call Details, we added a “year-quarter” column for use during aggregation.
    • Steps Taken:
      1. Right click on “Invoice Date” of Invoice Details / “Date” of Call Details
      2. Select “New Formula Column” -> “Date Time” -> “Year Quarter” to create the new column
  2. For “Product Names” differences in Invoice Details and Call Details, we standardized a common format using the recode function. (No screenshots due to confidentiality clause)
    • Steps Taken:
      1. Click on “Product Name” of Invoice Details / “Product” of Call Details
      2. Go to “Cols” at the top -> Select “Recode”
      3. At the “recode” interface, input new values that are the standardized common format
      4. Click on “Done” and “In Place” to replace the old values

Adding Dimensionalities

The second stage of data preparation involves adding dimensionalities to Invoice Details and Call Details tables using HCO, HCP and Employee tables.

  1. Invoice Details and HCO are left outer joined to add dimensionality of clinic’s “name” needed for further joins, while keeping records in Invoice Details intact.
    • Steps Taken:
      1. We have identified the matching variables to be “CUSTOMER_CODE” in Invoice Details and “ZP Account” in HCO, both are IDs of clinics
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig5.png
      1. Ensure that Left Outer Join is selected and output columns are populated before clicking “OK” to apply join
  1. Call Details and HCP are left outer joined to add dimensionality of “primary parent” (clinic name of individual doctors) needed for further joins, while keeping records in Call Details intact.
    • Steps Taken:
      1. We have identified the common variables to be “Account” in Call Detailsand “Name” in HCP, both are name of doctors
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig6.png
      1. Ensure that Left Outer Join is selected and output columns are populated before clicking “OK” to apply join
  1. Invoice Details and Employee are left outer joined to add dimensionality of “therapy area” (sales teams) needed for further analysis, while keeping records in Call Details intact.
    • Steps Taken:
      1. We have identified the matching variables to be “Rep Name” in both Invoice Details and HCP, which are names of sales rep. Additionally, “year quarter” from both tables are also identified because sales reps may change their “Therapy area” (sales teams) by quarter basis.
      2. We will utilize another function “Update” to perform the same left outer join
      3. Go to “Tables” and “Update”
      4. Enter fields as shown in screenshot below
Mattfig7.png
      1. Click “OK” to apply update

Data Aggregation

The third stage of data preparation involves aggregating Invoice Details and Call Details

  1. Invoice Details are aggregated by “year-quarter” to derive additional columns “sum(sales qty)” and “sum(amount$)”, addition to variables we are interested at: “channel” (clinic type), “rep name”, “product name”, “name” (clinic’s name) and “therapy area” (sales teams).
    • Steps Taken:
      1. Aggregation will be performed using “Summary” function.
      2. Go to “Tables” and “Summary”
      3. Drop fields to Statistics and Group as shown in screenshot below
Mattfig8.png
      1. Click “OK” to summarize the tables
  1. Call Details are aggregated by “year-quarter” and “primary parent” to derive additional column of “no. of rows” (interaction count), addition to variables we are interested at: “call: owner name” (sales rep’s name) and “product”.
    • Steps Taken:
      1. Aggregation will be performed using “Summary” function.
      2. Go to “Tables” and “Summary”
      3. Drop fields to Statistics and Group as shown in screenshot below
Mattfig9.png
      1. Click “OK” to summarize the tables

Data Integration

The final stage of data preparation involves joining Invoice Details and Call Details

  1. Invoice Details and Call Details are inner joined by sales rep’s name, clinic’s name, product’s name and year-quarter. Other variables present in the final table are “channel”, “therapy area”, “interaction count”, “sum(sales qty)” and “sum(amount$).
    • Steps Taken:
      1. We have identified the following matching variables from Call Details and Invoice Details
Mattfig9b.png
      1. We will utilize Join function again to perform the inner join
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig10.png
      1. Ensure that inner join is selected and output columns are populated before clicking “OK” to finish the join


ACTUAL METHOD: Analysis of Variance (ANOVA) using Fit Y by X

Analysis of Variance is a statistical method used to analyze differences among group means and their variances among and between groups. It is also a form of statistical hypothesis testing to test whether differences between pairs of group means are significant or not.

Prior to using ANOVA, we have attempted using linear regression to generalize the relationship between number of interactions and sales revenue. However, low R-squared values that suggest weak correlation and model not fitting the data were obtained, and these prompted us to carry out similar analysis using nonparametric tests like ANOVA.

The primary step to carry out ANOVA is to discretize our explanatory variable - “interaction count” into bins and as such, converting it from a numerical to categorical variable. The objective of discretization is because we wish to understand whether each of these interaction bins have significant differences between one another when it comes to sales revenue (response).

To define the range of interaction counts for “Low”, “Medium” and “High” interaction bins, we consulted our sponsor, who proposed that “Low” is for interaction count less than or equal to 1, “Medium” is for interaction count from 2 to 4 and “High” is for interaction count 5 and above.

The steps taken to discretize interaction counts into bins are as follow:

  1. Insert new column right of Interaction Column and name it as Interaction Bin
  2. Right click header of Interaction Bin, select “Formula”
  3. An interface to formulate the new column is displayed
Mattfig13.png
  1. Using various Conditional and Comparison functions, enter the following formula proposed by our sponsor
Mattfig14.png
  1. Upon clicking “OK”, the new column will be populated with values of “low”, “medium” and “high”



The next step to conducting ANOVA would be to use Fit Y by X function. Fit Y by X function can detect whether response or explanatory variables selected are numerical or categorical, and selectively carry out bivariate, oneway, logistic or contingency analysis. In our scenario, our “X, Factor” or explanatory variable is interaction bins (categorical) and “Y, Response” is sales amount (numerical), thus, the analysis conducted would be oneway.

The steps taken to use Fit Y by X function for ANOVA is as follows:

  1. Go to “Analyze” and “Fit Y by X”
  2. Drop Sum(Amount$) to “Y, Response” and Interaction Bin to “X, Factor”
  3. To look into the perspective of individual channels or therapy areas when comparing their means, we will also drop Channel or Therapy Area to “By”
  4. Click on “OK” to get one way analysis of Sum(Amount$) by interaction bin for individual channels/ therapy areas
  5. To get in-depth details of quantiles for each interaction bin, select the upside-down red arrow and click on “Quantiles”
Mattfig15.png
  1. Red box plot for each interaction bin will appear
Mattfig15b.png
  1. To conduct Tukey-Kramer HSD test for all pairs of interaction bins, select the upside-down red arrow again, and click on “Compare Means” and “All Pairs, Tukey HSD”
  2. A few reports will appear below the graph, but our attention is on the ordered differences report, which calculates p-Value to show whether differences between means of interaction bins are significant or not. Fig 16 below is an instance of the output
Mattfig16.png