Be Customer Wise or Otherwise - Findings

From Analytics Practicum
Jump to navigation Jump to search

HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT MANAGEMENT


Data Collection and Preparation

Merging and Cleaning

GLC provided us with three datasets, namely Metadata, CRM, and Sales, collected over the 12 months of 2008. Metadata contains fields such as ‘Industry Description’ that help to interpret the industry codes used in the other files. The CRM dataset contains information about the accounts, including fields such as ‘New Ac Number’ and ‘Date Opened’. Finally, the Sales dataset contains 2.5 million rows of individual transactions that occurred within the 12 months of 2008. This file contains fields such as ‘Local Revenue’, ‘Destination’, ‘Origin’, ‘Billed Weight’ and ‘Sales Channel’.

Since we require variables from all three datasets, the first step we took was to merge them by joining the tables based on common fields such as ‘New Ac Number’ for the convenience of our subsequent analysis. Thereafter, we derived certain variables that, from our literature review, would be used in our later analysis, such as ‘Date of Last Transaction’ (i.e. Recency), ‘No. of Transactions’ per account (i.e. Frequency) and ‘Total 2008 Revenue’ per account (i.e. Monetary).

The team also removed unnecessary and duplicate fields such as those that, for confidentiality purposes, were given an ‘XXX’ value for all rows.

Missing Data

Finally, we did a missing data analysis to examine how significant the missing data were and decided how to deal with them. A summary can be seen in the table below:

Variable Count (Missing)
description of activities 2424057 (95.42%)
inbound_contract_code 2099942 (82.66%)
outbound_contract_code 328756 (12.94%)
industry_code_CRM 265198 (10.44%)
site_grouping 214763 (8.45%)
local revenue 24970 (0.98%)
zip_code 3466 (0.136%)


With the exception for local revenue, the rest of these fields were not used in our analysis, hence it did not affect our subsequent findings. As for the field 'local revenue', its missing values were actually coming from the accounts that were recorded in the CRM dataset but did not have any sales transaction in 2008 as seen from the Sales dataset. Hence, upon the merger of the data, there were no transaction data for these accounts, and as a result they were reflected in the missing data analysis. These accounts would be excluded from our subsequent analysis and separately reported to our sponsor for further action (e.g. review of the CRM dataset).

Feedback for Missing Data

In addition, based on this huge amount of missing data, we would like to feedback to the management also to find out why these fields are missing. We would like to understand if this indicates potential problems with the data collection process. Having too many missing data is not ideal as we will not be able to make use of the information to gather and analyse potential trends and insights. Moreover, refinement of the data collection system would allow for higher quality of analysis to be done in the future.

Exploratory Data Analysis