Difference between revisions of "ANLY482 AY2016-17 T2 Group10 Project Overview: Data"

Revision as of 08:11, 14 January 2017

Data Collection

The data seems to be in the form of 12 excel files. Each contains 1 or more sheets with multiple columns. Hence the data is very high in dimensionality. Metadata is not yet available, but from column headers and the conversation with the sponsor, we have an idea on which ones will be more relevant to us. Such data include sales information, results of the sale staff, and data on the methods of the salespeople. These data have been promised to us.

We will need to determine which columns to focus our analysis on. This will be done in conversations with our sponsor as we seek to understand the data. Once we have understood the metadata, we will then be able to pull out the sales and other relevant data to begin exploratory data analysis. The reason for selecting only a portion of the data is that the large dimensionality would strain computer hardware and slow analysis. Additionally, there is a large amount of data that would not be in the scope of our project. We will be focusing on sales methods and results.

Data Preparation

We will need to clean the data. We would need to explore the data iteratively to identify anomalous patterns which we can then eliminate. For example, there could be many different versions of records that all refer to the same thing. “GSK”, “GlaxoSmithKline”, “GlaxoSmithKline plc” all refer to the same entity.

Missing values will also be handled in this stage. The exact way we handle them will be determined when we actually take a look at the data. Our decision will be based on factors such as what data is missing, what proportion, etc. We may omit the rows with missing data from our analysis, or we may try to interpolate the missing data, etc.

Exploratory Data Analysis

A descriptive analytics dashboard will be created via JMP Pro. We will seek to uncover patterns and anomalies. We will perform scatter plots and histograms to identify trends. For example, if we find that certain teams have very little face-to-face interactions with customers, they may require more confidence training or the client they have been assigned is less receptive to face-to-face meetings. Any assumptions that we have, either by preconceived notions or passed to us by GSK will also be tested in this phase.

@@ Line 43: / Line 43: @@
 <!-- Body -->
-==<div style="background: #ffffff; padding: 17px; line-height: 0.1em;  text-indent: 10px; font-size:17px; font-family: Helvetica;  border-left:8px solid #1b96fe"><font color= #000000><strong>Data</strong></font></div>==
-<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left">
+<big>Data Collection</big>
-</div>
+The data seems to be in the form of 12 excel files. Each contains 1 or more sheets with multiple columns. Hence the data is very high in dimensionality. Metadata is not yet available, but from column headers and the conversation with the sponsor, we have an idea on which ones will be more relevant to us. Such data include sales information, results of the sale staff, and data on the methods of the salespeople. These data have been promised to us.
+We will need to determine which columns to focus our analysis on. This will be done in conversations with our sponsor as we seek to understand the data. Once we have understood the metadata, we will then be able to pull out the sales and other relevant data to begin exploratory data analysis. The reason for selecting only a portion of the data is that the large dimensionality would strain computer hardware and slow analysis. Additionally, there is a large amount of data that would not be in the scope of our project. We will be focusing on sales methods and results.
+<big>Data Preparation</big>
+We will need to clean the data. We would need to explore the data iteratively to identify anomalous patterns which we can then eliminate. For example, there could be many different versions of records that all refer to the same thing. “GSK”, “GlaxoSmithKline”, “GlaxoSmithKline plc” all refer to the same entity.
+Missing values will also be handled in this stage. The exact way we handle them will be determined when we actually take a look at the data. Our decision will be based on factors such as what data is missing, what proportion, etc. We may omit the rows with missing data from our analysis, or we may try to interpolate the missing data, etc.
+<big>Exploratory Data Analysis</big>
+A descriptive analytics dashboard will be created via JMP Pro. We will seek to uncover patterns and anomalies. We will perform scatter plots and histograms to identify trends. For example, if we find that certain teams have very little face-to-face interactions with customers, they may require more confidence training or the client they have been assigned is less receptive to face-to-face meetings. Any assumptions that we have, either by preconceived notions or passed to us by GSK will also be tested in this phase.
 <!-- End Body --->

Difference between revisions of "ANLY482 AY2016-17 T2 Group10 Project Overview: Data"

Revision as of 08:11, 14 January 2017

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools