Difference between revisions of "AY1617 T5 Team AP Methodology"

From Analytics Practicum
Jump to navigation Jump to search
m (Updated Methodology steps)
m (Added steps for methodology)
Line 37: Line 37:
 
</ol>
 
</ol>
 
   <br/>
 
   <br/>
 +
 +
==<font face ="Impact" color= #00ADEF size="5" >Data Preparation(Data Integration)</font>==
 +
To analyse the performance of a video on both Facebook and Youtube, we would first have to integrate both Facebook and YouTube datasets. If the ID of the videos on both datasets are different, we would then have to integrate the 2 data sets using other variables such as the title.  <br/>
 +
 +
==<font face ="Impact" color= #00ADEF size="5" >Model Planning(Interactive Exploratory Data Analysis)</font>==
 +
Post integration, further analysis will be conducted on the data to provide the basic trends. This is to bring out insights which may not be directly clear on Facebook and YouTube individually but as a whole. It also includes summary statistics to highlight to our client the key high performers based on the available metrics.
 +
 +
Outliers for the data may also include viral videos which the client have curated. For these videos, they may be put together for further analysis. This may also include videos with interesting video titles. An example of this identifying this could be a trend with videos post which may have higher video plays but lower “at least 30 seconds view” or “view to 95% length”.<br/>
 +
 +
==<font face ="Impact" color= #00ADEF size="5" >Model Building(Analysis and model building)</font>==
 +
We will then proceed to do our analysis of the various components and create analytical models using related variables. For our prediction models, suitable techniques, such as stepwise regression, can be carried out. We will then determine which variables are the best predictors using a threshold, and remove other variable which are not producing the best results.
 +
<br/>
 +
 +
==<font face ="Impact" color= #00ADEF size="5" >Model Building(Model Validation)</font>==
 +
Model validation ensures that the models meet the intended requirements with regards to the methods used and the results obtained. Ultimately, we aim to have models that addresses the business problem and provides results with fairly high accuracy. We will split our data sets into 2 parts - training data and testing data. After we have built our models using the training data (Decision tree, linear/multilinear regression etc.), we need to check if the model is over or underfitting by running the test dataset through the models and compare both the results of the test and training data. For regression models, we also need to check the R-square values to ensure the model’s accuracy. 
 +
<br/>

Revision as of 02:54, 1 January 2017

SGAG HOME INACTIVE.PNG
SGAG OVERVIEW INACTIVE.PNG
SGAG MET ACTIVE.PNG
SGAG PM INACTIVE.PNG







SGAG FINDINGS INACTIVE.PNG
SGAG DOC INACTIVE.PNG
SGAG AU INACTIVE.PNG
SGAG LOGO.PNG


Discovery(Leading Business Domain)

Our team aim to understand SGAG through exploring the content published by SGAG through multiple channels, namely SGAG’s social network services, website and media contents.

Discovery(Business Problem Discovery)

Through interviews and speaking with the founder of the company, our team aims to understand the business problems of SGAG to assist us in identifying problems and translating them into analytics objective.

Data Preparation(Data Collection)

SGAG will be able to provide us with data of the uploaded video posts, generated from their facebook and youtube pages with variables listed above. Apart from what is given, we would also like to gather additional information of the videos, including the comments on the videos, characters involved in the video, video resolution, as well as the video type, whether it is an original or reposted video. As SGAG uploads videos of various categories and resolutions (Filmed using a phone versus video camera), we would like to accurately categorise them to further increase the dimension of the data set.

Data Preparation(Data Preparation)

Some of the datasets provided are not structured in the format suitable for analysis to be done (ie: Additional derived columns need to be added, tables need to be traversed etc.).
Data required for analysis on the video specifications in relation to audience receptivity will require additional use of open source tools like EXIFTool to extract the EXIF (Camera data) data like video width and video height.

Data Preparation(Initial Exploratory Data Analysis)

Before we do a more concrete and complete exploratory data analysis (EDA), we first would like to do a basic EDA to identify outliers that might, in future, affect our analysis results.

Data Preparation(Data Cleaning)

For data cleaning, we would be going through the data sets to identify missing data, inconsistent values, inconsistency between fields and duplicated data. For each row of data with missing value(s), we would then check the number of missing fields it has to decide whether we should omit the row of data completely, or should we predict a value for the missing field(s) using suitable techniques such as association rule mining and decision trees. Extracted data from EXIFTools will require the removal of irrelevant fields to remove less important data. The required fields from the exif data of videos are:

  1. Video height
  2. Video length
  3. Video Width
 

Data Preparation(Data Integration)

To analyse the performance of a video on both Facebook and Youtube, we would first have to integrate both Facebook and YouTube datasets. If the ID of the videos on both datasets are different, we would then have to integrate the 2 data sets using other variables such as the title.

Model Planning(Interactive Exploratory Data Analysis)

Post integration, further analysis will be conducted on the data to provide the basic trends. This is to bring out insights which may not be directly clear on Facebook and YouTube individually but as a whole. It also includes summary statistics to highlight to our client the key high performers based on the available metrics.

Outliers for the data may also include viral videos which the client have curated. For these videos, they may be put together for further analysis. This may also include videos with interesting video titles. An example of this identifying this could be a trend with videos post which may have higher video plays but lower “at least 30 seconds view” or “view to 95% length”.

Model Building(Analysis and model building)

We will then proceed to do our analysis of the various components and create analytical models using related variables. For our prediction models, suitable techniques, such as stepwise regression, can be carried out. We will then determine which variables are the best predictors using a threshold, and remove other variable which are not producing the best results.

Model Building(Model Validation)

Model validation ensures that the models meet the intended requirements with regards to the methods used and the results obtained. Ultimately, we aim to have models that addresses the business problem and provides results with fairly high accuracy. We will split our data sets into 2 parts - training data and testing data. After we have built our models using the training data (Decision tree, linear/multilinear regression etc.), we need to check if the model is over or underfitting by running the test dataset through the models and compare both the results of the test and training data. For regression models, we also need to check the R-square values to ensure the model’s accuracy.