Difference between revisions of "Be Customer Wise or Otherwise - Findings"

From Analytics Practicum
Jump to navigation Jump to search
Line 21: Line 21:
 
<!--Content Start-->
 
<!--Content Start-->
 
<div align="left">
 
<div align="left">
 +
==<div style="background: #FFBC35; padding: 10px; font-family:Segoe UI; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 13px; border-left: #ff8a18 solid 30px;"><font color="black">Data Collection and Preparation</font></div>==
 +
<div style="border-left: #ffffff solid 12px; padding: 0px 30px 0px 13px;">
 +
===<div style="margin-top: 10px; text-align:left; font-size: 16px; font-weight: bold;">Merging and Cleaning</div>===
 +
GLC provided us with 3 datasets, namely Metadata, CRM, and Sales, collected over the 12 months of 2008. Metadata contains fields such as ‘Industry Description’ that help to interpret the industry codes used in the other files. The CRM dataset contains information about the accounts, including fields such as ‘New Ac Number’ and ‘Date Opened’. Finally, the Sales dataset contains 2.5 million rows of individual transactions that occurred within the 12 months of 2008. This file contains fields such as ‘Local Revenue’, ‘Destination’, ‘Origin’, ‘Billed Weight’ and ‘Sales Channel’.
 +
 +
Since we require variables from all three datasets, the first step we took was to merge them by joining the tables based on common fields such as ‘New Ac Number’ for the convenience of our subsequent analysis. Thereafter, we derived certain variables that, from our literature review, would be used in our later analysis, such as ‘Date of Last Transaction’ (i.e. Recency), ‘No. of Transactions’ per account (i.e. Frequency) and ‘Total 2008 Revenue’ per account (i.e. Monetary).
 +
 +
We then removed unnecessary and duplicate fields such as those that, for confidentiality purposes, were given an ‘XXX’ value for all rows.
 +
 +
===<div style="margin-top: 10px; text-align:left; font-size: 16px; font-weight: bold;">Missing Data</div>===
 +
Finally, we did a missing data analysis to examine how significant the missing data were and decided how to deal with them. A summary can be seen in the table below.
 +
 +
{| border="1" cellpadding="5" cellspacing="0"
 +
! style="width=100%; text-align: center; font-weight: bold; background: #FFA66B;" | Variable
 +
! style="width=100%; text-align: center; font-weight: bold; background: #FFA66B;" | Count (Missing)
 +
|-
 +
|| description of activities
 +
| 2424057 (95.42%)
 +
|-
 +
|| inbound_contract_code
 +
| 2099942 (82.66%)
 +
|-
 +
|| outbound_contract_code
 +
| 328756 (12.94%)
 +
|-
 +
|| industry_code_CRM
 +
| 265198 (10.44%)
 +
|-
 +
|| site_grouping
 +
| 214763 (8.45%)
 +
|-
 +
|| local revenue
 +
| 24970 (0.98%)
 +
|-
 +
|| zip_code
 +
| 3466 (0.136%)
 +
|}
 +
 +
 +
</div>
 +
 
==<div style="background: #FFBC35; padding: 10px; font-family:Segoe UI; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 13px; border-left: #ff8a18 solid 30px;"><font color="black">Exploratory Data Analysis</font></div>==
 
==<div style="background: #FFBC35; padding: 10px; font-family:Segoe UI; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 13px; border-left: #ff8a18 solid 30px;"><font color="black">Exploratory Data Analysis</font></div>==
 
<div style="border-left: #ffffff solid 12px; padding: 0px 30px 0px 13px;">
 
<div style="border-left: #ffffff solid 12px; padding: 0px 30px 0px 13px;">

Revision as of 23:43, 2 March 2015

HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT MANAGEMENT


Data Collection and Preparation

Merging and Cleaning

GLC provided us with 3 datasets, namely Metadata, CRM, and Sales, collected over the 12 months of 2008. Metadata contains fields such as ‘Industry Description’ that help to interpret the industry codes used in the other files. The CRM dataset contains information about the accounts, including fields such as ‘New Ac Number’ and ‘Date Opened’. Finally, the Sales dataset contains 2.5 million rows of individual transactions that occurred within the 12 months of 2008. This file contains fields such as ‘Local Revenue’, ‘Destination’, ‘Origin’, ‘Billed Weight’ and ‘Sales Channel’.

Since we require variables from all three datasets, the first step we took was to merge them by joining the tables based on common fields such as ‘New Ac Number’ for the convenience of our subsequent analysis. Thereafter, we derived certain variables that, from our literature review, would be used in our later analysis, such as ‘Date of Last Transaction’ (i.e. Recency), ‘No. of Transactions’ per account (i.e. Frequency) and ‘Total 2008 Revenue’ per account (i.e. Monetary).

We then removed unnecessary and duplicate fields such as those that, for confidentiality purposes, were given an ‘XXX’ value for all rows.

Missing Data

Finally, we did a missing data analysis to examine how significant the missing data were and decided how to deal with them. A summary can be seen in the table below.

Variable Count (Missing)
description of activities 2424057 (95.42%)
inbound_contract_code 2099942 (82.66%)
outbound_contract_code 328756 (12.94%)
industry_code_CRM 265198 (10.44%)
site_grouping 214763 (8.45%)
local revenue 24970 (0.98%)
zip_code 3466 (0.136%)


Exploratory Data Analysis