Difference between revisions of "ANLY482 AY2016-17 T2 Group10 Project Overview: Methodology"

From Analytics Practicum
Jump to navigation Jump to search
Line 174: Line 174:
 
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig10.png]]</div>
 
<div style="border-radius:5px; display: inline-block; overflow: hidden;border: 1px solid black">[[File:Mattfig10.png]]</div>
 
#*# Ensure that inner join is selected and output columns are populated before clicking “OK” to finish the join
 
#*# Ensure that inner join is selected and output columns are populated before clicking “OK” to finish the join
 +
</div>
 +
<!-- End Body --->
 +
 +
 +
<!-- Body -->
 +
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>ACTUAL METHOD: Analysis of Variance (ANOVA) using Fit Y by X</strong></font></div>==
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
Analysis of Variance is a statistical method used to analyze differences among group means and their variances among and between groups. It is also a form of statistical hypothesis testing to test whether differences between pairs of group means are significant or not.
 +
<br/><br/>
 +
Prior to using ANOVA, we have attempted using linear regression to generalize the relationship between number of interactions and sales revenue. However, low R-squared values that suggest weak correlation and model not fitting the data were obtained, and these prompted us to carry out similar analysis using nonparametric tests like ANOVA.
 +
<br/><br/>
 +
The primary step to carry out ANOVA is to discretize our explanatory variable - “interaction count” into bins and as such, converting it from a numerical to categorical variable. The objective of discretization is because we wish to understand whether each of these interaction bins have significant differences between one another when it comes to sales revenue (response).
 +
<br/><br/>
 +
To define the range of interaction counts for “Low”, “Medium” and “High” interaction bins, we consulted our sponsor, who proposed that “Low” is for interaction count less than or equal to 1, “Medium” is for interaction count from 2 to 4 and “High” is for interaction count 5 and above.
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->

Revision as of 12:30, 21 April 2017

Kesmyjxlogo.png

HOME

ABOUT US

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

Overview

Data

Methodology

<< ANLY482 AY2016-17 T2 Projects


Tools Used

Jmppro.png

SAS JMP Pro 13 is chosen as our primary tool for data preparation, exploratory and further analysis. It is an analytical software that can perform most statistical analysis on large datasets and generate results with interactive visualizations used by data scientists to manipulate data selection on the go. Furthermore, tutorials and guides are widely available online for us to learn JMP Pro 13’s different techniques and functions.
More importantly, its easy-to-use built-in tools enable us to conduct analysis of variance to determine relationship between interactions and sales revenue.


Data Preparation

Data preparation took us through time-consuming and tedious procedures to obtain high quality data. Though seemingly unrewarding, a set of high quality data allows for more accurate, reliable and consistent analysis of results. Therefore, it is imperative to invest a lot of time and effort on it to avoid getting false conclusions for our hypothesis. Firstly, using SAS JMP Pro, these tables are scanned for anomalies such as missing values or outliers. We corrected them appropriately, using imputation for missing values and omission for extreme outliers. This step ensures that the data we are using will not give us misleading insights.

Next, to determine the causality relationship between interactions and sales revenue, we need to join Call Details and Invoice Details tables. This leads us to understand variables from both tables to find out 1) which two variables are the same and whether their formats are alike, 2) at which granularity does each row from both tables represents. Upon fully understanding them, we performed aggregations, standardizations of formats and inclusions of HCP and HCO tables to serve as links. Furthermore, we added the dimensionality of employees’ teams from Employee table, as it will be useful in describing the relationship as mentioned. Finally, we integrated relevant tables into a consolidated table and loaded it into the JMP server for analysis.

Mattfig1.png


The above diagram illustrates how data tables are being integrated, which took a few stages to achieve.

Data Cleaning

The first stage of data preparation involves cleaning Invoice Details and Call Details tables.

  1. For missing values under “price$” column in Invoice Details, we imputed their values to “0” as these rows contain records where free samples given out to customers.
Mattfig2.png
    • Steps Taken:
      1. Press Ctrl-F to show “search data tables” interface
Mattfig3.png
      1. Enter the following fields:
        • Find what: *
        • Replace with 0
        • Tick “Match entire cell value”
        • Tick “Restrict to selected column” of “Price$”
        • Tick “Search data”
      2. Click on “Replace All” to apply change
  1. For negative “sales qty” and “amount$” values in Invoice Details, we did not take any action as they serve as records to void any sales that has been cancelled. Upon aggregation by quarters, no negative values will be present.
  1. For 5-digits postal codes in Invoice Details, we created a new column to store the converted ‘postal code’, with data type changed from numerical to categorical and formula function used to add the missing ‘0’.
    • Steps Taken:
      1. Right click on Postal Code’s header -> Select “Column Info” -> Change data type to “Character”
      2. Right click on Postal Code’s header -> Select “Insert Columns”
      3. Right click new column -> Select “Formula” -> Enter the formula as shown below
Mattfig4.png
      1. Click on “OK” to apply change
  1. To facilitate understanding quarterly performance of sales revenue in Invoice Details and Call Details, we added a “year-quarter” column for use during aggregation.
    • Steps Taken:
      1. Right click on “Invoice Date” of Invoice Details / “Date” of Call Details
      2. Select “New Formula Column” -> “Date Time” -> “Year Quarter” to create the new column
  2. For “Product Names” differences in Invoice Details and Call Details, we standardized a common format using the recode function. (No screenshots due to confidentiality clause)
    • Steps Taken:
      1. Click on “Product Name” of Invoice Details / “Product” of Call Details
      2. Go to “Cols” at the top -> Select “Recode”
      3. At the “recode” interface, input new values that are the standardized common format
      4. Click on “Done” and “In Place” to replace the old values

Adding Dimensionalities

The second stage of data preparation involves adding dimensionalities to Invoice Details and Call Details tables using HCO, HCP and Employee tables.

  1. Invoice Details and HCO are left outer joined to add dimensionality of clinic’s “name” needed for further joins, while keeping records in Invoice Details intact.
    • Steps Taken:
      1. We have identified the matching variables to be “CUSTOMER_CODE” in Invoice Details and “ZP Account” in HCO, both are IDs of clinics
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig5.png
      1. Ensure that Left Outer Join is selected and output columns are populated before clicking “OK” to apply join
  1. Call Details and HCP are left outer joined to add dimensionality of “primary parent” (clinic name of individual doctors) needed for further joins, while keeping records in Call Details intact.
    • Steps Taken:
      1. We have identified the common variables to be “Account” in Call Detailsand “Name” in HCP, both are name of doctors
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig6.png
      1. Ensure that Left Outer Join is selected and output columns are populated before clicking “OK” to apply join
  1. Invoice Details and Employee are left outer joined to add dimensionality of “therapy area” (sales teams) needed for further analysis, while keeping records in Call Details intact.
    • Steps Taken:
      1. We have identified the matching variables to be “Rep Name” in both Invoice Details and HCP, which are names of sales rep. Additionally, “year quarter” from both tables are also identified because sales reps may change their “Therapy area” (sales teams) by quarter basis.
      2. We will utilize another function “Update” to perform the same left outer join
      3. Go to “Tables” and “Update”
      4. Enter fields as shown in screenshot below
Mattfig7.png
      1. Click “OK” to apply update

Data Aggregation

The third stage of data preparation involves aggregating Invoice Details and Call Details

  1. Invoice Details are aggregated by “year-quarter” to derive additional columns “sum(sales qty)” and “sum(amount$)”, addition to variables we are interested at: “channel” (clinic type), “rep name”, “product name”, “name” (clinic’s name) and “therapy area” (sales teams).
    • Steps Taken:
      1. Aggregation will be performed using “Summary” function.
      2. Go to “Tables” and “Summary”
      3. Drop fields to Statistics and Group as shown in screenshot below
Mattfig8.png
      1. Click “OK” to summarize the tables
  1. Call Details are aggregated by “year-quarter” and “primary parent” to derive additional column of “no. of rows” (interaction count), addition to variables we are interested at: “call: owner name” (sales rep’s name) and “product”.
    • Steps Taken:
      1. Aggregation will be performed using “Summary” function.
      2. Go to “Tables” and “Summary”
      3. Drop fields to Statistics and Group as shown in screenshot below
Mattfig9.png
      1. Click “OK” to summarize the tables

Data Integration

The final stage of data preparation involves joining Invoice Details and Call Details

  1. Invoice Details and Call Details are inner joined by sales rep’s name, clinic’s name, product’s name and year-quarter. Other variables present in the final table are “channel”, “therapy area”, “interaction count”, “sum(sales qty)” and “sum(amount$).
    • Steps Taken:
      1. We have identified the following matching variables from Call Details and Invoice Details
Mattfig9b.png
      1. We will utilize Join function again to perform the inner join
      2. Go to “Tables” and “Join”
      3. Enter fields as shown in screenshot below
Mattfig10.png
      1. Ensure that inner join is selected and output columns are populated before clicking “OK” to finish the join


ACTUAL METHOD: Analysis of Variance (ANOVA) using Fit Y by X

Analysis of Variance is a statistical method used to analyze differences among group means and their variances among and between groups. It is also a form of statistical hypothesis testing to test whether differences between pairs of group means are significant or not.

Prior to using ANOVA, we have attempted using linear regression to generalize the relationship between number of interactions and sales revenue. However, low R-squared values that suggest weak correlation and model not fitting the data were obtained, and these prompted us to carry out similar analysis using nonparametric tests like ANOVA.

The primary step to carry out ANOVA is to discretize our explanatory variable - “interaction count” into bins and as such, converting it from a numerical to categorical variable. The objective of discretization is because we wish to understand whether each of these interaction bins have significant differences between one another when it comes to sales revenue (response).

To define the range of interaction counts for “Low”, “Medium” and “High” interaction bins, we consulted our sponsor, who proposed that “Low” is for interaction count less than or equal to 1, “Medium” is for interaction count from 2 to 4 and “High” is for interaction count 5 and above.