Difference between revisions of "ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology"

Latest revision as of 20:09, 19 February 2017

Background		Data Source		Methodology

Tools Used

For data preparation and EDA, the team chose to use JMP software as they are familiar with the usage of this software. To facilitate future extension of the project, the client requested for us to use R programming language for the final outcome. R has a mature and growing ecosystem of open-source tools for mathematics and data analysis.

Methodology

Data Collection

Kaiso provided us with transaction records on musical and concert data. These records consist of data from both phone booking and internet booking channels. Apart from these transaction datasets, Kaiso also provided us with customer demographics data and sports matches data. In total, we have obtained 9 datasets from them:

Lottery transaction data (lottery15.csv, lotteryAug-Oct.csv, lotteryRB.csv)
Sports transaction data (sports15.csv, sportsAug-Oct.csv, sportsRB.csv)
Sports matches data (Matches_Master.csv, League name.xlsx)
Customer demographics data (data_cst.xlsx)

Literature Review

To gain more domain knowledge, we will seek to read up on research papers, articles and news related to our area of topic which is ticketing analytics. Furthermore, we aim to focus our reading on online ticketing because we will be using it as our basis when we perform our analysis. In addition, this will provide us with sufficient theoretical knowledge to conduct these analyses.
In this project, we will be conducting comparison analysis on the datasets. Thus, we will also be exploring on papers related to “cross-sectional analysis” and “longitudinal analysis” to aid us in our understanding of this two subjects.

Data Preparation

Before performing any further data analysis, the first step is to prepare the data. We will clean the data to handle outliers and missing values. In addition, we will perform data normalization and transformation on the given dataset.
For outliers, we will first determine if the values are due to human or system error. If it is due to human or system error, we can safely remove that transaction from our analysis. Otherwise, we will conduct separate analysis of these outliers values.
For missing values, we will determine the number of missing values. If the number is significant, we will use prediction techniques to predict these values based on the data set. Otherwise, we will remove these transactions from our analysis so that it will not affect our findings.
Lastly, we will perform data normalization and transformation. Some fields in the phone purchasing dataset and internet purchasing dataset have different scales and values even though they represent the same information. Also, due to system changes in Kaiso's IT infrastructure, there are some differences in the way the data is stored and named. Therefore, we will perform data normalization and transformation to ensure that values throughout both dataset are consistent before we can perform any analysis.

Exploratory Data Analysis (EDA)

In the initial stage of this project, we will examine the dataset to have a better understanding of the various aspects of the dataset. We will then proceed to perform comparison studies between the datasets. The purpose of the comparison studies is to identify any behavioral differences among the customers. There are two studies which we will be doing - cross-sectional analysis and longitudinal analysis. Some of the comparisons which we will be looking at for both analyses are the frequencies of transactions for account holders in relation to the different ticketing types, the popular time of transaction, type of transaction and amount per transaction.

Cross-sectional Analysis
In this analysis, we will perform comparison study on customers in the same time period of 2015 and 2016. We have 2 months of data for 2016 and will be subsetting the 2015 dataset to contain records from the same time period only.
The purpose of using the same time period for both years is to eliminate any seasonal fluctuations that exists in the datasets.

Longitudinal Analysis
For this analysis, we will be examining the behavioral change of old customers that bought tickers before and after the launch of the online ticketing channel. This analysis aims to answer the question on whether customers purchasing behaviour changed after the launch. Hence, we will filter out data records to include only old customers (customers who registered before the launch).
We will be doing comparisons on data two months before and two months after the launch of the online ticketing site.

Dashboard

Following the analysis, an analytical dashboard will be built to visualize our findings. The dashboard will display the key variables of the data and how they affect the customer purchasing behaviour. The customer engaging teams would be able to utilize the dashboard to display and better understand the differences between the customer behaviours before and after the launch of the new system.
The dashboard will use a framework that allows Kaiso to update their dashboard by uploading their dataset every time they have a new dataset. Design, statistics and visualization will be our main considerations when building the dashboard so that they can easily unveil the differences that they are looking for.

Recommendations & Insights

From our analysis and dashboard, we seek to assist Kaiso in understanding the characteristics of their customers. We will be proposing business strategies and recommendations to them based on the insights that we have uncovered.

@@ Line 3: / Line 3: @@
 <!--Header Start-->
 {|style="background-color:#6A8D9D; color: #F5F5F5; padding: 10 0 10 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
-| style="padding:0.3em; font-size:100%; background-color:#6A8D9D; text-align:center; color: #F5F5F5" width="10%" |
+| style="padding:0.3em; font-size:100%; background-color:#466675; text-align:center; color: #F5F5F5" width="10%" |
-[[ANLY482 AY2016-17 T2 Group 2|
+[[ANLY482_AY2016-17_T2_Group_2|
 <font color="#F5F5F5" size=2><b>HOME</b></font>]]
-| style="background:none;" width="1%" | &nbsp;
-| style="padding:0.3em; font-size:100%; background-color:#466675; text-align:center; color:#F5F5F5" width="10%" |
-[[ANLY482 AY2016-17 T2 Group 2 Project Overview|
-<font color="#F5F5F5" size=2><b>PROJECT OVERVIEW</b></font>]]
 | style="background:none;" width="1%" | &nbsp;
 | style="padding:0.3em; font-size:100%; background-color:#6A8D9D; text-align:center; color:#F5F5F5" width="10%" |
-[[ANLY482 AY2016-17 T2 Group 2 Findings|
+[[ANLY482_AY2016-17_T2_Group_2 Findings|
 <font color="#F5F5F5" size=2><b>FINDINGS</b></font>]]
 | style="background:none;" width="1%" | &nbsp;
 | style="padding:0.3em; font-size:100%; background-color:#6A8D9D; text-align:center; color:#F5F5F5" width="10%" |
-[[ANLY482 AY2016-17 T2 Group 2 Project Documentation|
+[[ANLY482_AY2016-17_T2_Group_2 Project Documentation|
 <font color="#F5F5F5" size=2><b>PROJECT DOCUMENTATION</b></font>]]
 | style="background:none;" width="1%" | &nbsp;
 | style="padding:0.3em; font-size:100%; background-color:#6A8D9D; text-align:center; color:#F5F5F5" width="10%" |
-[[ANLY482 AY2016-17 T2 Group 2 Project Management|
+[[ANLY482_AY2016-17_T2_Group_2 Project Management|
 <font color="#F5F5F5" size=2><b>PROJECT MANAGEMENT</b></font>]]
+| style="background:none;" width="1%" | &nbsp;
+| style="padding:0.3em; font-size:100%; background-color:#6A8D9D; text-align:center; color:#F5F5F5" width="10%" |
+[[ANLY482_AY2016-17_T2_Group_2 About Us|
+<font color="#F5F5F5" size=2><b>ABOUT US</b></font>]]
+| style="background:none;" width="1%" | &nbsp;
+| style="padding:0.3em; font-size:100%; background-color:#6A8D9D; text-align:center; color:#F5F5F5" width="10%" |
+[[Main_Page|
+<font color="#F5F5F5" size=2><b>ANLY482 HOMEPAGE</b></font>]]
 |}
 <!--Header End-->
@@ Line 47: / Line 52: @@
 ==<div style="background: #6A8D9D; line-height: 0.3em; font-family:helvetica;  border-left: #466675 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF"><strong>Tools Used</strong></font></div></div>==
 <div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
-'''Programming Language'''
+<p>For data preparation and EDA, the team chose to use JMP software as they are familiar with the usage of this software. To facilitate future extension of the project, the client requested for us to use R programming language for the final outcome. R has a mature and growing ecosystem of open-source tools for mathematics and data analysis.
-* Python
+</p>
-'''Programming Libraries'''
-* Python Natural Language Toolkit (NLTK)
-* Scikit-learn
-* TensorFlow
-* pandas
-'''IDE'''
-* Jupyter Notebook
 </div>
@@ Line 66: / Line 64: @@
 | style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Collection</strong><br></font>
 |}
-We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use for our initial analysis. Subsequently, we will build a program to scrape for more data from Jobsbank.gov.sg.<br>
+Kaiso provided us with transaction records on musical and concert data. These records consist of data from both phone booking and internet booking channels. Apart from these transaction datasets, Kaiso also provided us with customer demographics data and sports matches data. In total, we have obtained 9 datasets from them:
-Apart from collecting job posting data, we will also use the program to scrape the web for words that are relates to a job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in subsequent steps.
+#Lottery transaction data (lottery15.csv, lotteryAug-Oct.csv, lotteryRB.csv)
+#Sports transaction data (sports15.csv, sportsAug-Oct.csv, sportsRB.csv)
+#Sports matches data (Matches_Master.csv, League name.xlsx)
+#Customer demographics data (data_cst.xlsx)
 </div>
-<!--Data Preparation Content-->
+<!--Literature Review Content-->
 <div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 {| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
-| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Preparation</strong><br></font>
+| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Literature Review</strong><br></font>
 |}
-The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and html codes and breaking them down into meaningful words. In addition, we will perform word normalization to each word to its base form.
+To gain more domain knowledge, we will seek to read up on research papers, articles and news related to our area of topic which is ticketing analytics. Furthermore, we aim to focus our reading on online ticketing because we will be using it as our basis when we perform our analysis. In addition, this will provide us with sufficient theoretical knowledge to conduct these analyses. <br>
+In this project, we will be conducting comparison analysis on the datasets. Thus, we will also be exploring on papers related to “cross-sectional analysis” and “longitudinal analysis” to aid us in our understanding of this two subjects.
 </div>
-<!--EDA Content-->
+<!--Data Preparation Content-->
 <div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 {| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
-| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Exploratory Data Analysis (EDA)</strong><br></font>
+| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Preparation</strong><br></font>
 |}
-After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words. <br>
+Before performing any further data analysis, the first step is to prepare the data. We will clean the data to handle outliers and missing values. In addition, we will perform data normalization and transformation on the given dataset. <br>
-Some of the analysis that we will look at are frequencies of words and words that can be use to describe the skills. We will use this information to expand the whitelist that we have previously obtained. This process of EDA will be useful in the next step to identify good features for skills.
+For outliers, we will first determine if the values are due to human or system error. If it is due to human or system error, we can safely remove that transaction from our analysis. Otherwise, we will conduct separate analysis of these outliers values.<br>
+For missing values, we will determine the number of missing values. If the number is significant, we will use prediction techniques to predict these values based on the data set. Otherwise, we will remove these transactions from our analysis so that it will not affect our findings.<br>
+Lastly, we will perform data normalization and transformation. Some fields in the phone purchasing dataset and internet purchasing dataset have different scales and values even though they represent the same information. Also, due to system changes in Kaiso's IT infrastructure, there are some differences in the way the data is stored and named. Therefore, we will perform data normalization and transformation to ensure that values throughout both dataset are consistent before we can perform any analysis.
 </div>
-<!--Labels Content-->
+<!--EDA Content-->
 <div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 {| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
-| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
+| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Exploratory Data Analysis (EDA)</strong><br></font>
 |}
-With the tokens generated previously and the EDA conducted, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from the data collection and EDA process. These features will help to label true or false for each token in the job description.
+In the initial stage of this project, we will examine the dataset to have a better understanding of the various aspects of the dataset. We will then proceed to perform comparison studies between the datasets. The purpose of the comparison studies is to identify any behavioral differences among the customers. There are two studies which we will be doing - cross-sectional analysis and longitudinal analysis. Some of the comparisons which we will be looking at for both analyses are the frequencies of transactions for account holders in relation to the different ticketing types, the popular time of transaction, type of transaction and amount per transaction.<br>
+<br>
+<u>Cross-sectional Analysis</u><br>
+In this analysis, we will perform comparison study on customers in the same time period of 2015 and 2016. We have 2 months of data for 2016 and will be subsetting the 2015 dataset to contain records from the same time period only. <br>
+The purpose of using the same time period for both years is to eliminate any seasonal fluctuations that exists in the datasets. <br>
+<br>
+<u>Longitudinal Analysis</u><br>
+For this analysis, we will be examining the behavioral change of old customers that bought tickers before and after the launch of the online ticketing channel. This analysis aims to answer the question on whether customers purchasing behaviour changed after the launch. Hence, we will filter out data records to include only old customers (customers who registered before the launch). <br>
+We will be doing comparisons on data two months before and two months after the launch of the online ticketing site.
 </div>
-<!--Model Training Content-->
+<!--Dashboard Content-->
 <div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 {| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
-| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
+| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Dashboard</strong><br></font>
 |}
-We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.We will look into utilizing algorithms including Rule-based, Support Vector Machine and Neural Networks to train our models using the labelled data.<br><br>
+Following the analysis, an analytical dashboard will be built to visualize our findings. The dashboard will display the key variables of the data and how they affect the customer purchasing behaviour. The customer engaging teams would be able to utilize the dashboard to display and better understand the differences between the customer behaviours before and after the launch of the new system.<br>
-'''Rule Based'''<br>
+The dashboard will use a framework that allows Kaiso to update their dashboard by uploading their dataset every time they have a new dataset. Design, statistics and visualization will be our main considerations when building the dashboard so that they can easily unveil the differences that they are looking for.
-The Rule-based classifier makes use of a set of IF-THEN rules for classification. Using the whitelist that we obtained, we will come out with rule antecedents and its corresponding rule consequents. Given the condition holds true for a given tuple, the antecedent will be satisfied.
-<br><br>
-'''Support Vector Machines'''<br>
-The Support Vector Machine (SVM) is a supervised learning model with associated learning algorithms that analyse data and recognize patterns used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories; an SVM training algorithm builds a model that assigns new examples into one category or the other. Given our training dataset, an SVM training algorithm will build a model that assigns new examples into one category or the other based on our whitelist.
-<br>
-<br>
-'''Neural Networks'''<br>
-Neural Networks is a type of supervised learning. It contains several components, called artificial neurons (AN), which are interconnected. Each AN is a logistic regression unit which accepts a number of inputs and outputs the class it believes the item belongs to based on our whitelist.
-<br>
-<br>
 </div>
-<!--Model Validation Content-->
+<!--Recommendations & Insights Content-->
 <div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 {| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
-| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Model Validation</strong><br></font>
+| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Recommendations & Insights</strong><br></font>
 |}
-For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.
+From our analysis and dashboard, we seek to assist Kaiso in understanding the characteristics of their customers. We will be proposing business strategies and recommendations to them based on the insights that we have uncovered.
 </div>

Difference between revisions of "ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology"

Latest revision as of 20:09, 19 February 2017

Tools Used

Methodology

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools