Difference between revisions of "ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology"

From Analytics Practicum
Jump to navigation Jump to search
Line 63: Line 63:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Collection</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Collection</strong><br></font>
 
|}
 
|}
We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use for our initial analysis. Subsequently, we will build a program to scrape for more data from Jobsbank.gov.sg.<br>
+
Kaisou will provide us with 3 datasets: musical data, concert data and customer profile data. The datasets consists of transaction records from both phone booking and internet booking channels as well as customer details records. Apart from the data provided, we will also look into collecting external data that may affect our analysis such as the dates of public holidays.
Apart from collecting job posting data, we will also use the program to scrape the web for words that are relates to a job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in subsequent steps.  
+
</div>
 +
 
 +
<!--EDA Content-->
 +
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 +
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 +
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Exploratory Data Analysis (EDA)</strong><br></font>
 +
|}
 +
In the initial stage of this project, we will examine the dataset to have a better understanding of the various aspects of the dataset. This will also help us in the next stage of data preparation by identifying outliers and anomalies. Furthermore, we can perform normalization and transformation on the data if they are not consistent. We will also use EDA to help us identify important variables for subsequent steps such as correlation analysis.<br>
 +
Some of the analysis which we will look at are the frequencies of transactions for account holders in relation to the different bet types and the popular time of transaction, type of transaction and amount of transaction.
 
</div>
 
</div>
  
Line 72: Line 80:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Preparation</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Preparation</strong><br></font>
 
|}
 
|}
The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and html codes and breaking them down into meaningful words. In addition, we will perform word normalization to each word to its base form.
+
Before performing any further data analysis, the first step is to prepare the data. We will clean the data to handle outliers and missing values. In addition, we will perform data normalization and transformation on the given dataset. <br>
 +
For outliers, we will first determine if the values are due to human or system error. If it is due to human or system error, we can safely remove that transaction from our analysis. Otherwise, we will conduct separate analysis of these outliers values.<br>
 +
For missing values, we will determine the number of missing values. If the number is significant, we will use prediction techniques to predict these values based on the data set. Otherwise, we will remove these transactions from our analysis so that it will not affect our findings.<br>
 +
Lastly, we will perform data normalization and transformation. Some fields in the phone purchasing dataset and internet purchasing dataset have different scales and values even though they represent the same information. Also, due to system changes in Kaisou's IT infrastructure, there are some differences in the way the data is stored and named. Therefore, we will perform data normalization and transformation to ensure that values throughout both dataset are consistent before we can perform any analysis.  
 
</div>
 
</div>
  
<!--EDA Content-->
+
<!--Association Rule Content-->
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Exploratory Data Analysis (EDA)</strong><br></font>
+
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Association Rule Mining</strong><br></font>
 
|}
 
|}
After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words. <br>
+
Association rule mining is a rule-based method to discover interesting relations in the dataset. We will conduct analysis on the betting transactions to determine betting patterns, which are known as rules, between customers and the different betting channels. These rules can then be used by Singapore Pools as the basis for marketing strategies for their products.
Some of the analysis that we will look at are frequencies of words and words that can be use to describe the skills. We will use this information to expand the whitelist that we have previously obtained. This process of EDA will be useful in the next step to identify good features for skills.
 
 
</div>
 
</div>
  
<!--Labels Content-->
+
<!--Correlation Analysis Content-->
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
+
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Correlation Analysis </strong><br></font>
 
|}
 
|}
With the tokens generated previously and the EDA conducted, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from the data collection and EDA process. These features will help to label true or false for each token in the job description.
+
We will perform correlation analysis and observe the interactions of various variables, which we have identified from EDA, with the bet amount. From the correlation coefficient, we will be able to determine the strengths of these relationships and find out does these relationships correlate to the betting patterns for both betting channels.
 
</div>
 
</div>
  
<!--Model Training Content-->
+
<!--Dashboard Content-->
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Model Training</strong><br></font>
+
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Dashboard</strong><br></font>
 
|}
 
|}
We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.We will look into utilizing algorithms including Rule-based, Support Vector Machine and Neural Networks to train our models using the labelled data.<br><br>
+
Following the analysis that was carried out, a dashboard will be built to aid in the visualization of the findings. The dashboard will showcase the important variables and its interactions with customer purchasing behaviour. This will be an easy way for the customer engaging teams to use and understand specific behaviours of their customers.
'''Rule Based'''<br>
 
The Rule-based classifier makes use of a set of IF-THEN rules for classification. Using the whitelist that we obtained, we will come out with rule antecedents and its corresponding rule consequents. Given the condition holds true for a given tuple, the antecedent will be satisfied.
 
<br><br>
 
 
 
'''Support Vector Machines'''<br>
 
The Support Vector Machine (SVM) is a supervised learning model with associated learning algorithms that analyse data and recognize patterns used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories; an SVM training algorithm builds a model that assigns new examples into one category or the other. Given our training dataset, an SVM training algorithm will build a model that assigns new examples into one category or the other based on our whitelist.
 
<br>
 
<br>
 
 
 
'''Neural Networks'''<br>
 
Neural Networks is a type of supervised learning. It contains several components, called artificial neurons (AN), which are interconnected. Each AN is a logistic regression unit which accepts a number of inputs and outputs the class it believes the item belongs to based on our whitelist.
 
<br>
 
<br>
 
 
 
 
</div>
 
</div>
  
<!--Model Validation Content-->
+
<!--Recommendations & Insights Content-->
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Model Validation</strong><br></font>
+
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Recommendations & Insights</strong><br></font>
 
|}
 
|}
For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.
+
From our analysis and dashboard, we seek to assist Kaisou in understanding the characteristics of their customers. We will be proposing business strategies and recommendations to them based on the insights that we have uncovered.
 
</div>
 
</div>

Revision as of 16:27, 8 January 2017


HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

Background Data Source Methodology


Tools Used

Based on the client requirements for the project, the programming language that we will be using is Python. Python has a mature and growing ecosystem of open-source tools for mathematics and data analysis. Jupyter Notebook is the best IDE for Python and data analytics. Some of the libraries that we will be venturing to are:

  • Python Natural Language Toolkit (NLTK)
  • Scikit-learn
  • TensorFlow
  • pandas

Methodology

Data Collection

Kaisou will provide us with 3 datasets: musical data, concert data and customer profile data. The datasets consists of transaction records from both phone booking and internet booking channels as well as customer details records. Apart from the data provided, we will also look into collecting external data that may affect our analysis such as the dates of public holidays.

Exploratory Data Analysis (EDA)

In the initial stage of this project, we will examine the dataset to have a better understanding of the various aspects of the dataset. This will also help us in the next stage of data preparation by identifying outliers and anomalies. Furthermore, we can perform normalization and transformation on the data if they are not consistent. We will also use EDA to help us identify important variables for subsequent steps such as correlation analysis.
Some of the analysis which we will look at are the frequencies of transactions for account holders in relation to the different bet types and the popular time of transaction, type of transaction and amount of transaction.

Data Preparation

Before performing any further data analysis, the first step is to prepare the data. We will clean the data to handle outliers and missing values. In addition, we will perform data normalization and transformation on the given dataset.
For outliers, we will first determine if the values are due to human or system error. If it is due to human or system error, we can safely remove that transaction from our analysis. Otherwise, we will conduct separate analysis of these outliers values.
For missing values, we will determine the number of missing values. If the number is significant, we will use prediction techniques to predict these values based on the data set. Otherwise, we will remove these transactions from our analysis so that it will not affect our findings.
Lastly, we will perform data normalization and transformation. Some fields in the phone purchasing dataset and internet purchasing dataset have different scales and values even though they represent the same information. Also, due to system changes in Kaisou's IT infrastructure, there are some differences in the way the data is stored and named. Therefore, we will perform data normalization and transformation to ensure that values throughout both dataset are consistent before we can perform any analysis.

Association Rule Mining

Association rule mining is a rule-based method to discover interesting relations in the dataset. We will conduct analysis on the betting transactions to determine betting patterns, which are known as rules, between customers and the different betting channels. These rules can then be used by Singapore Pools as the basis for marketing strategies for their products.

Correlation Analysis

We will perform correlation analysis and observe the interactions of various variables, which we have identified from EDA, with the bet amount. From the correlation coefficient, we will be able to determine the strengths of these relationships and find out does these relationships correlate to the betting patterns for both betting channels.

Dashboard

Following the analysis that was carried out, a dashboard will be built to aid in the visualization of the findings. The dashboard will showcase the important variables and its interactions with customer purchasing behaviour. This will be an easy way for the customer engaging teams to use and understand specific behaviours of their customers.

Recommendations & Insights

From our analysis and dashboard, we seek to assist Kaisou in understanding the characteristics of their customers. We will be proposing business strategies and recommendations to them based on the insights that we have uncovered.