Difference between revisions of "AY1617 T5 Team AP Methodology"

From Analytics Practicum
Jump to navigation Jump to search
m (remove extra break row)
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<!--Header Start-->
 
<!--Header Start-->
 
<div style="padding-left:8em;">
 
<div style="padding-left:8em;">
<div style="float: left; width:20%;">[[Image:SGAG_HOME_INACTIVE.PNG|170px|link=ANLY482 AY2016-17 T2 Group5]]</div>
+
<div style="float: left; width:20%;">[[Image:SGAG_HOME_INACTIVE.PNG|170px|link=ANLY482 AY2016-17 Term 2|]]</div>
 
<div style="float: left;  width:20%; padding-top:3.8em; padding-left:1.7em;">[[Image:SGAG_OVERVIEW_INACTIVE.PNG|170px|link=AY1617 T5 Team AP_Overview]]</div>
 
<div style="float: left;  width:20%; padding-top:3.8em; padding-left:1.7em;">[[Image:SGAG_OVERVIEW_INACTIVE.PNG|170px|link=AY1617 T5 Team AP_Overview]]</div>
 
<div style="float: left;  width:20%; padding-top:3.8em;padding-left:1.7em;">[[Image:SGAG_MET_ACTIVE.PNG|170px|link=AY1617 T5 Team AP Methodology]]</div>
 
<div style="float: left;  width:20%; padding-top:3.8em;padding-left:1.7em;">[[Image:SGAG_MET_ACTIVE.PNG|170px|link=AY1617 T5 Team AP Methodology]]</div>
Line 9: Line 9:
 
<div style="float: left;padding-left:1.7em; width:20%;">[[Image:SGAG_DOC_INACTIVE.PNG|170px|link=AY1617 T5 Team AP_Documentation]]</div>
 
<div style="float: left;padding-left:1.7em; width:20%;">[[Image:SGAG_DOC_INACTIVE.PNG|170px|link=AY1617 T5 Team AP_Documentation]]</div>
 
<div style="float: left;padding-left:1.7em; width:20%;">[[Image:SGAG_AU_INACTIVE.PNG|170px|link=AY1617 T5 Team About_Us]]</div>
 
<div style="float: left;padding-left:1.7em; width:20%;">[[Image:SGAG_AU_INACTIVE.PNG|170px|link=AY1617 T5 Team About_Us]]</div>
<div style="float: left;padding-left:1.7em; width:20%;">[[Image:SGAG_LOGO.PNG|170px]]</div>
+
<div style="float: left;padding-left:1.7em; width:20%;">[[Image:SGAG_LOGO.PNG|170px|link=ANLY482 AY2016-17 T2 Group5]]</div>
 
</div>
 
</div>
 
<!--Header End-->
 
<!--Header End-->
<br><br><br><br><br><br><br>
+
<br><br><br><br><br>
==<font face ="Impact" color= #00ADEF size="5" >Discovery(Leading Business Domain)</font>==
+
<div style="padding-left:18em;">
 +
<div style="float: left; width:20%; padding-top:0.45em">[[Image:SGAG_INITIAL_ACTIVE.PNG|90px]]</div>
 +
<div style="float: left;  width:20%; padding-left:1.5em;">[[Image:SGAG_MT.PNG|145px|link=AY1617 T5 Team AP Methodology Mid-Term]]</div>
 +
<div style="float: left;  width:20%; padding-left:3.2em;">[[Image:SGAG_FINALS.PNG|100px|link=AY1617 T5 Team AP Methodology Finals]]</div>
 +
</div>
 +
<br><br><br>
 +
==<font face ="Impact" color= #00ADEF size="5" >DATA ANALYTICS LIFE CYCLE</font>==
 +
Our team adopts the data analytics lifecycle as part of the planning process to guide us through the different stages of the project. The diagram below illustrates the key stages which our team have adopted.
 +
<div  align="left">[[Image:SGAG LIFECYCLE.png|300px]]</div>
 +
 
 +
==<font face ="Impact" color= #00ADEF size="5" >LEADING BUSINESS DOMAIN  (DISCOVERY)</font>==
 
Our team aim to understand SGAG through exploring the content published by SGAG through multiple channels, namely SGAG’s social network services, website and media contents. <br/>
 
Our team aim to understand SGAG through exploring the content published by SGAG through multiple channels, namely SGAG’s social network services, website and media contents. <br/>
==<font face ="Impact" color= #00ADEF size="5" >Discovery(Business Problem Discovery)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >BUSINESS PROBLEM  (DISCOVERY)</font>==
 
Through interviews and speaking with the founder of the company, our team aims to understand the business problems of SGAG to assist us in identifying problems and translating them into analytics objective. <br/>
 
Through interviews and speaking with the founder of the company, our team aims to understand the business problems of SGAG to assist us in identifying problems and translating them into analytics objective. <br/>
==<font face ="Impact" color= #00ADEF size="5" >Data Preparation(Data Collection)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >DATA COLLECTION  (DATA PREPARATION)</font>==
 
SGAG will be able to provide us with data of the uploaded video posts, generated from their facebook and youtube pages with variables listed above. Apart from what is given, we would also like to gather additional information of the videos, including the comments on the videos, characters involved in the video, video resolution, as well as the video type, whether it is an original or reposted video. As SGAG uploads videos of various categories and resolutions (Filmed using a phone versus video camera), we would like to accurately categorise them to further increase the dimension of the data set. <br/>
 
SGAG will be able to provide us with data of the uploaded video posts, generated from their facebook and youtube pages with variables listed above. Apart from what is given, we would also like to gather additional information of the videos, including the comments on the videos, characters involved in the video, video resolution, as well as the video type, whether it is an original or reposted video. As SGAG uploads videos of various categories and resolutions (Filmed using a phone versus video camera), we would like to accurately categorise them to further increase the dimension of the data set. <br/>
  
==<font face ="Impact" color= #00ADEF size="5" >Data Preparation(Data Preparation)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >DATA PREPARATION</font>==
 
Some of the datasets provided are not structured in the format suitable for analysis to be done (ie: Additional derived columns need to be added, tables need to be traversed etc.).
 
Some of the datasets provided are not structured in the format suitable for analysis to be done (ie: Additional derived columns need to be added, tables need to be traversed etc.).
 
<br/>
 
<br/>
 
Data required for analysis on the video specifications in relation to audience receptivity will require additional use of open source tools like EXIFTool to extract the EXIF (Camera data) data like video width and video height.  <br/>
 
Data required for analysis on the video specifications in relation to audience receptivity will require additional use of open source tools like EXIFTool to extract the EXIF (Camera data) data like video width and video height.  <br/>
  
==<font face ="Impact" color= #00ADEF size="5" >Data Preparation(Initial Exploratory Data Analysis)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >Initial Exploratory Data Analysis (Data Preparation)</font>==
 
Before we do a more concrete and complete exploratory data analysis (EDA), we first would like to do a basic EDA to identify outliers that might, in future, affect our analysis results.  <br/>
 
Before we do a more concrete and complete exploratory data analysis (EDA), we first would like to do a basic EDA to identify outliers that might, in future, affect our analysis results.  <br/>
  
==<font face ="Impact" color= #00ADEF size="5" >Data Preparation(Data Cleaning)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >DATA CLEANING  (DATA PREPARATION)</font>==
 
For data cleaning, we would be going through the data sets to identify missing data, inconsistent values, inconsistency between fields and duplicated data. For each row of data with missing value(s), we would then check the number of missing fields it has to decide whether we should omit the row of data completely, or should we predict a value for the missing field(s) using suitable techniques such as association rule mining and decision trees.  
 
For data cleaning, we would be going through the data sets to identify missing data, inconsistent values, inconsistency between fields and duplicated data. For each row of data with missing value(s), we would then check the number of missing fields it has to decide whether we should omit the row of data completely, or should we predict a value for the missing field(s) using suitable techniques such as association rule mining and decision trees.  
 
Extracted data from EXIFTools will require the removal of irrelevant fields to remove less important data. The required fields from the exif data of videos are:
 
Extracted data from EXIFTools will require the removal of irrelevant fields to remove less important data. The required fields from the exif data of videos are:
Line 37: Line 47:
 
</ol>
 
</ol>
  
==<font face ="Impact" color= #00ADEF size="5" >Data Preparation(Data Integration)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >DATA INTEGRATION  (DATA PREPARATION)</font>==
To analyse the performance of a video on both Facebook and Youtube, we would first have to integrate both Facebook and YouTube datasets. If the ID of the videos on both datasets are different, we would then have to integrate the 2 data sets using other variables such as the title. <br/>
+
For the videos uploaded on Facebook, we intend to collect 3 sets of data - Video data on Facebook, data of the post on Facebook that was used to share the video and the data on video specifications (resolution, dimensions etc.). Facebook data of the video and post can be integrated using the full ID (xxxxx_xxxxxx). The dataset on the video specifications can be integrated by using the last 16 digits (digits after the underscore) of the ID.
  
==<font face ="Impact" color= #00ADEF size="5" >Model Planning(Interactive Exploratory Data Analysis)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >INTERACTIVE EXPLORATORY DATA ANALYSIS  (MODEL PLANNING)</font>==
 
Post integration, further analysis will be conducted on the data to provide the basic trends. This is to bring out insights which may not be directly clear on Facebook and YouTube individually but as a whole. It also includes summary statistics to highlight to our client the key high performers based on the available metrics.  
 
Post integration, further analysis will be conducted on the data to provide the basic trends. This is to bring out insights which may not be directly clear on Facebook and YouTube individually but as a whole. It also includes summary statistics to highlight to our client the key high performers based on the available metrics.  
  
 
Outliers for the data may also include viral videos which the client have curated. For these videos, they may be put together for further analysis. This may also include videos with interesting video titles. An example of this identifying this could be a trend with videos post which may have higher video plays but lower “at least 30 seconds view” or “view to 95% length”.<br/>
 
Outliers for the data may also include viral videos which the client have curated. For these videos, they may be put together for further analysis. This may also include videos with interesting video titles. An example of this identifying this could be a trend with videos post which may have higher video plays but lower “at least 30 seconds view” or “view to 95% length”.<br/>
  
==<font face ="Impact" color= #00ADEF size="5" >Model Building(Analysis and model building)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >ANALYSIS AND MODEL BUILDING  (MODEL BUILDING)</font>==
 
We will then proceed to do our analysis of the various components and create analytical models using related variables. For our prediction models, suitable techniques, such as stepwise regression, can be carried out. We will then determine which variables are the best predictors using a threshold, and remove other variable which are not producing the best results.
 
We will then proceed to do our analysis of the various components and create analytical models using related variables. For our prediction models, suitable techniques, such as stepwise regression, can be carried out. We will then determine which variables are the best predictors using a threshold, and remove other variable which are not producing the best results.
 
<br/>
 
<br/>
  
==<font face ="Impact" color= #00ADEF size="5" >Model Building(Model Validation)</font>==
+
==<font face ="Impact" color= #00ADEF size="5" >MODEL VALIDATION (MODEL BUILDING)</font>==
 
Model validation ensures that the models meet the intended requirements with regards to the methods used and the results obtained. Ultimately, we aim to have models that addresses the business problem and provides results with fairly high accuracy. We will split our data sets into 2 parts - training data and testing data. After we have built our models using the training data (Decision tree, linear/multilinear regression etc.), we need to check if the model is over or underfitting by running the test dataset through the models and compare both the results of the test and training data. For regression models, we also need to check the R-square values to ensure the model’s accuracy.   
 
Model validation ensures that the models meet the intended requirements with regards to the methods used and the results obtained. Ultimately, we aim to have models that addresses the business problem and provides results with fairly high accuracy. We will split our data sets into 2 parts - training data and testing data. After we have built our models using the training data (Decision tree, linear/multilinear regression etc.), we need to check if the model is over or underfitting by running the test dataset through the models and compare both the results of the test and training data. For regression models, we also need to check the R-square values to ensure the model’s accuracy.   
 
<br/>
 
<br/>

Latest revision as of 17:34, 21 February 2017

SGAG HOME INACTIVE.PNG
SGAG OVERVIEW INACTIVE.PNG
SGAG MET ACTIVE.PNG
SGAG PM INACTIVE.PNG







SGAG FINDINGS INACTIVE.PNG
SGAG DOC INACTIVE.PNG
SGAG AU INACTIVE.PNG
SGAG LOGO.PNG






SGAG INITIAL ACTIVE.PNG
SGAG MT.PNG
SGAG FINALS.PNG




DATA ANALYTICS LIFE CYCLE

Our team adopts the data analytics lifecycle as part of the planning process to guide us through the different stages of the project. The diagram below illustrates the key stages which our team have adopted.

SGAG LIFECYCLE.png

LEADING BUSINESS DOMAIN (DISCOVERY)

Our team aim to understand SGAG through exploring the content published by SGAG through multiple channels, namely SGAG’s social network services, website and media contents.

BUSINESS PROBLEM (DISCOVERY)

Through interviews and speaking with the founder of the company, our team aims to understand the business problems of SGAG to assist us in identifying problems and translating them into analytics objective.

DATA COLLECTION (DATA PREPARATION)

SGAG will be able to provide us with data of the uploaded video posts, generated from their facebook and youtube pages with variables listed above. Apart from what is given, we would also like to gather additional information of the videos, including the comments on the videos, characters involved in the video, video resolution, as well as the video type, whether it is an original or reposted video. As SGAG uploads videos of various categories and resolutions (Filmed using a phone versus video camera), we would like to accurately categorise them to further increase the dimension of the data set.

DATA PREPARATION

Some of the datasets provided are not structured in the format suitable for analysis to be done (ie: Additional derived columns need to be added, tables need to be traversed etc.).
Data required for analysis on the video specifications in relation to audience receptivity will require additional use of open source tools like EXIFTool to extract the EXIF (Camera data) data like video width and video height.

Initial Exploratory Data Analysis (Data Preparation)

Before we do a more concrete and complete exploratory data analysis (EDA), we first would like to do a basic EDA to identify outliers that might, in future, affect our analysis results.

DATA CLEANING (DATA PREPARATION)

For data cleaning, we would be going through the data sets to identify missing data, inconsistent values, inconsistency between fields and duplicated data. For each row of data with missing value(s), we would then check the number of missing fields it has to decide whether we should omit the row of data completely, or should we predict a value for the missing field(s) using suitable techniques such as association rule mining and decision trees. Extracted data from EXIFTools will require the removal of irrelevant fields to remove less important data. The required fields from the exif data of videos are:

  1. Video height
  2. Video length
  3. Video Width

DATA INTEGRATION (DATA PREPARATION)

For the videos uploaded on Facebook, we intend to collect 3 sets of data - Video data on Facebook, data of the post on Facebook that was used to share the video and the data on video specifications (resolution, dimensions etc.). Facebook data of the video and post can be integrated using the full ID (xxxxx_xxxxxx). The dataset on the video specifications can be integrated by using the last 16 digits (digits after the underscore) of the ID.

INTERACTIVE EXPLORATORY DATA ANALYSIS (MODEL PLANNING)

Post integration, further analysis will be conducted on the data to provide the basic trends. This is to bring out insights which may not be directly clear on Facebook and YouTube individually but as a whole. It also includes summary statistics to highlight to our client the key high performers based on the available metrics.

Outliers for the data may also include viral videos which the client have curated. For these videos, they may be put together for further analysis. This may also include videos with interesting video titles. An example of this identifying this could be a trend with videos post which may have higher video plays but lower “at least 30 seconds view” or “view to 95% length”.

ANALYSIS AND MODEL BUILDING (MODEL BUILDING)

We will then proceed to do our analysis of the various components and create analytical models using related variables. For our prediction models, suitable techniques, such as stepwise regression, can be carried out. We will then determine which variables are the best predictors using a threshold, and remove other variable which are not producing the best results.

MODEL VALIDATION (MODEL BUILDING)

Model validation ensures that the models meet the intended requirements with regards to the methods used and the results obtained. Ultimately, we aim to have models that addresses the business problem and provides results with fairly high accuracy. We will split our data sets into 2 parts - training data and testing data. After we have built our models using the training data (Decision tree, linear/multilinear regression etc.), we need to check if the model is over or underfitting by running the test dataset through the models and compare both the results of the test and training data. For regression models, we also need to check the R-square values to ensure the model’s accuracy.