Difference between revisions of "Be Customer Wise or Otherwise - Findings"
Line 21: | Line 21: | ||
<!--Content Start--> | <!--Content Start--> | ||
<div align="left"> | <div align="left"> | ||
+ | ==<div style="background: #FFBC35; padding: 10px; font-family:Segoe UI; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 13px; border-left: #ff8a18 solid 30px;"><font color="black">Data Collection and Preparation</font></div>== | ||
+ | <div style="border-left: #ffffff solid 12px; padding: 0px 30px 0px 13px;"> | ||
+ | ===<div style="margin-top: 10px; text-align:left; font-size: 16px; font-weight: bold;">Merging and Cleaning</div>=== | ||
+ | GLC provided us with 3 datasets, namely Metadata, CRM, and Sales, collected over the 12 months of 2008. Metadata contains fields such as ‘Industry Description’ that help to interpret the industry codes used in the other files. The CRM dataset contains information about the accounts, including fields such as ‘New Ac Number’ and ‘Date Opened’. Finally, the Sales dataset contains 2.5 million rows of individual transactions that occurred within the 12 months of 2008. This file contains fields such as ‘Local Revenue’, ‘Destination’, ‘Origin’, ‘Billed Weight’ and ‘Sales Channel’. | ||
+ | |||
+ | Since we require variables from all three datasets, the first step we took was to merge them by joining the tables based on common fields such as ‘New Ac Number’ for the convenience of our subsequent analysis. Thereafter, we derived certain variables that, from our literature review, would be used in our later analysis, such as ‘Date of Last Transaction’ (i.e. Recency), ‘No. of Transactions’ per account (i.e. Frequency) and ‘Total 2008 Revenue’ per account (i.e. Monetary). | ||
+ | |||
+ | We then removed unnecessary and duplicate fields such as those that, for confidentiality purposes, were given an ‘XXX’ value for all rows. | ||
+ | |||
+ | ===<div style="margin-top: 10px; text-align:left; font-size: 16px; font-weight: bold;">Missing Data</div>=== | ||
+ | Finally, we did a missing data analysis to examine how significant the missing data were and decided how to deal with them. A summary can be seen in the table below. | ||
+ | |||
+ | {| border="1" cellpadding="5" cellspacing="0" | ||
+ | ! style="width=100%; text-align: center; font-weight: bold; background: #FFA66B;" | Variable | ||
+ | ! style="width=100%; text-align: center; font-weight: bold; background: #FFA66B;" | Count (Missing) | ||
+ | |- | ||
+ | || description of activities | ||
+ | | 2424057 (95.42%) | ||
+ | |- | ||
+ | || inbound_contract_code | ||
+ | | 2099942 (82.66%) | ||
+ | |- | ||
+ | || outbound_contract_code | ||
+ | | 328756 (12.94%) | ||
+ | |- | ||
+ | || industry_code_CRM | ||
+ | | 265198 (10.44%) | ||
+ | |- | ||
+ | || site_grouping | ||
+ | | 214763 (8.45%) | ||
+ | |- | ||
+ | || local revenue | ||
+ | | 24970 (0.98%) | ||
+ | |- | ||
+ | || zip_code | ||
+ | | 3466 (0.136%) | ||
+ | |} | ||
+ | |||
+ | |||
+ | </div> | ||
+ | |||
==<div style="background: #FFBC35; padding: 10px; font-family:Segoe UI; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 13px; border-left: #ff8a18 solid 30px;"><font color="black">Exploratory Data Analysis</font></div>== | ==<div style="background: #FFBC35; padding: 10px; font-family:Segoe UI; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 13px; border-left: #ff8a18 solid 30px;"><font color="black">Exploratory Data Analysis</font></div>== | ||
<div style="border-left: #ffffff solid 12px; padding: 0px 30px 0px 13px;"> | <div style="border-left: #ffffff solid 12px; padding: 0px 30px 0px 13px;"> |
Revision as of 23:43, 2 March 2015
Contents
Data Collection and Preparation
Merging and Cleaning
GLC provided us with 3 datasets, namely Metadata, CRM, and Sales, collected over the 12 months of 2008. Metadata contains fields such as ‘Industry Description’ that help to interpret the industry codes used in the other files. The CRM dataset contains information about the accounts, including fields such as ‘New Ac Number’ and ‘Date Opened’. Finally, the Sales dataset contains 2.5 million rows of individual transactions that occurred within the 12 months of 2008. This file contains fields such as ‘Local Revenue’, ‘Destination’, ‘Origin’, ‘Billed Weight’ and ‘Sales Channel’.
Since we require variables from all three datasets, the first step we took was to merge them by joining the tables based on common fields such as ‘New Ac Number’ for the convenience of our subsequent analysis. Thereafter, we derived certain variables that, from our literature review, would be used in our later analysis, such as ‘Date of Last Transaction’ (i.e. Recency), ‘No. of Transactions’ per account (i.e. Frequency) and ‘Total 2008 Revenue’ per account (i.e. Monetary).
We then removed unnecessary and duplicate fields such as those that, for confidentiality purposes, were given an ‘XXX’ value for all rows.
Missing Data
Finally, we did a missing data analysis to examine how significant the missing data were and decided how to deal with them. A summary can be seen in the table below.
Variable | Count (Missing) |
---|---|
description of activities | 2424057 (95.42%) |
inbound_contract_code | 2099942 (82.66%) |
outbound_contract_code | 328756 (12.94%) |
industry_code_CRM | 265198 (10.44%) |
site_grouping | 214763 (8.45%) |
local revenue | 24970 (0.98%) |
zip_code | 3466 (0.136%) |