Difference between revisions of "ANLY482 AY2016-17 T2 Group10 Project Overview: Methodology"

From Analytics Practicum
Jump to navigation Jump to search
Line 43: Line 43:
 
</div>
 
</div>
 
<!------- End of Secondary Navigation Bar---->
 
<!------- End of Secondary Navigation Bar---->
 +
  
 
<!-- Body -->
 
<!-- Body -->
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Data Preparation</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Tools Used</strong></font></div>==
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
Data preparation involves cleaning, transformation, and integration, which are standard procedures to standardize data across different datasets for their many formats, errors in data entries and granularity. We will first look at each of the data files, determine best ways to standardize formats and then perform aggregations on more granular data for integration purposes.  
+
SAS JMP Pro 13 is chosen as our primary tool for data preparation, exploratory and further analysis. It is an analytical software that can perform most statistical analysis on large datasets and generate results with interactive visualizations used by data scientists to manipulate data selection on the go. Furthermore, tutorials and guides are widely available online for us to learn JMP Pro 13’s different techniques and functions. More importantly, its easy-to-use built-in tools enable us to conduct analysis of variance to determine relationship between interactions and sales revenue.
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->
Line 53: Line 54:
  
 
<!-- Body -->
 
<!-- Body -->
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>MCCP</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Data Preparation</strong></font></div>==
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
Data preparation took us through time-consuming and tedious procedures to obtain high quality data. Though seemingly unrewarding, a set of high quality data allows for more accurate, reliable and consistent analysis of results. Therefore, it is imperative to invest a lot of time and effort on it to avoid getting false conclusions for our hypothesis.
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->
Line 60: Line 62:
  
 
<!-- Body -->
 
<!-- Body -->
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Invoice Details</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>MCCP</strong></font></div>==
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
===<span style="line-height: 0.1em;text-indent: 10px;background-color:#1b96fe;padding:5px;border-radius:5px;font-size:15px"><font color="white">Data Cleaning</font></span>===
 
A brief scan of the entire Invoice Details data table led to 3 main areas to be cleaned.
 
# Missing values in Price$ column
 
# Negative values in Sales Qty and Amount$ columns
 
# Some Postal Code with only 5 digits (because they start with 0)
 
<br/>
 
====Handling of missing values in Price$ column====
 
The Price$ column determines the unit price of a specific dosage (SKU) of a drug and it can vary across different customers, time for different reasons (marketing, incentive for new purchase, etc). It becomes important for us to know why some of them have missing values because the unit price of any drug is usually defined before any purchase.<br/>
 
Upon close inspection on the missing values using data filter, we are made known the following:
 
* 2379 rows with missing
 
* 1677 rows belong to product E/F
 
* Most records have sales amount which are $0
 
* Either Bonus Qty or Sample Qty are positive
 
This tell us that these rows represented transactions that took place when drugs are given as samples or bonuses to serve as goodwill.<br/>
 
Actions taken:
 
* We will be assigning a fair value of 0 to the missing values as JMP will ignore rows which have missing values if we were to take into consideration of price in our predictive analysis.
 
<br/>
 
====Handling of negative values in Sales Qty and Amount$ columns====
 
The Sales Qty and Amount$ columns indicate the quantity of drug and total amount involved in the transaction for a dosage (SKU) of a drug. Out of pure curiosity of the presence of negative values, we asked Elaine and she explained that negative values are credit amount which are needed to offset the initial sales.<br/>
 
We are made known of the following:
 
* 1233 rows with both Sales Qty and Amount$ negative
 
** Have corresponding transactions
 
* 229 rows with only Amount$ negative, all Sales Qty = 0, all Price$ have missing values
 
** No corresponding transactions
 
Actions taken:
 
* For the 229 rows, there is no indication of what the credit sales could be for and hence, we will filter them out.
 
* For the 1233 rows, we can simply ignore them as the corresponding transactions will cancel them out.
 
<br/>
 
====Handling of Postal Code with only 5 digits====
 
Initial import of the data table assumes that the Postal Code is a numerical variable instead of a categorical one. This causes some of the values which starts with 0 to be omitted. This is an easy fix whereby we only need to perform the recode function available in JMP.
 
<br/>
 
 
===<span style="line-height: 0.1em;text-indent: 10px;background-color:#1b96fe;padding:5px;border-radius:5px;font-size:15px"><font color="white">Data Transformation</font></span>===
 
To facilitate integrating the Invoice Details data table with others, we will add new formulated column like Year-Quarter.
 
====Addition of Year-Quarter columns====
 
The Year-Quarter column categorizes the invoice date into Year-Quarter, as businesses usually look at quarterly sales performance.
 
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->
  
  
<!-- Body -->
 
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe"><font color= #000000><strong>Integrated MCCP & Invoice</strong></font></div>==
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
===<span style="line-height: 0.1em;text-indent: 10px;background-color:#1b96fe;padding:5px;border-radius:5px;font-size:15px"><font color="white">Data Transformation</font></span>===
 
Data transformation at integration stage involves the following:
 
# Determining linking variables between the two tables
 
# Comparing differences in formats
 
# Aggregating based on granularity of variables
 
# Joining of tables
 
<br />
 
====Determining linking variables between the two tables====
 
====Comparing differences in formats====
 
<u>Sales Rep Name Differences</u><br />
 
All Sales Rep Name in MCCP can be found in Invoice but there are 8 additional names not captured in MCCP. A data filter on these 8 names shows records from all quarters and hence, does not suggest reasons related to departure.<br/><br/>
 
<u>Product Name Differences</u><br />
 
All MCCP product names have a corresponding alias in Invoice albeit differences in format and granularity. For 1-to-1 mapping, 1-to-Many or Many-to-1 mapping, a single, standardized name will be assigned. This will ascribe variants of a product (RandomBrandName, RandomBrandName – Variant) to its main brand (RandomBrandName).
 
Recoding is performed for both data tables in JMP.
 
<br />
 
====Aggregating based on granularity of variables====
 
<u>Aggregating Invoice</u><br />
 
<u>Aggregating MCCP</u>
 
<br />
 
====Joining of tables====
 
</div>
 
 
<!-- End Body --->
 
<!-- End Body --->

Revision as of 11:13, 15 April 2017

Kesmyjxlogo.png

HOME

ABOUT US

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

Overview

Data

Methodology

<< ANLY482 AY2016-17 T2 Projects


Tools Used

SAS JMP Pro 13 is chosen as our primary tool for data preparation, exploratory and further analysis. It is an analytical software that can perform most statistical analysis on large datasets and generate results with interactive visualizations used by data scientists to manipulate data selection on the go. Furthermore, tutorials and guides are widely available online for us to learn JMP Pro 13’s different techniques and functions. More importantly, its easy-to-use built-in tools enable us to conduct analysis of variance to determine relationship between interactions and sales revenue.


Data Preparation

Data preparation took us through time-consuming and tedious procedures to obtain high quality data. Though seemingly unrewarding, a set of high quality data allows for more accurate, reliable and consistent analysis of results. Therefore, it is imperative to invest a lot of time and effort on it to avoid getting false conclusions for our hypothesis.


MCCP