AY1617 T5 Team AP Methodology
Contents
Discovery(Leading Business Domain)
Our team aim to understand SGAG through exploring the content published by SGAG through multiple channels, namely SGAG’s social network services, website and media contents.
Discovery(Business Problem Discovery)
Through interviews and speaking with the founder of the company, our team aims to understand the business problems of SGAG to assist us in identifying problems and translating them into analytics objective.
Data Preparation(Data Collection)
SGAG will be able to provide us with data of the uploaded video posts, generated from their facebook and youtube pages with variables listed above. Apart from what is given, we would also like to gather additional information of the videos, including the comments on the videos, characters involved in the video, video resolution, as well as the video type, whether it is an original or reposted video. As SGAG uploads videos of various categories and resolutions (Filmed using a phone versus video camera), we would like to accurately categorise them to further increase the dimension of the data set.
Data Preparation(Data Preparation)
Some of the datasets provided are not structured in the format suitable for analysis to be done (ie: Additional derived columns need to be added, tables need to be traversed etc.).
Data required for analysis on the video specifications in relation to audience receptivity will require additional use of open source tools like EXIFTool to extract the EXIF (Camera data) data like video width and video height.
Data Preparation(Initial Exploratory Data Analysis)
Before we do a more concrete and complete exploratory data analysis (EDA), we first would like to do a basic EDA to identify outliers that might, in future, affect our analysis results.
Data Preparation(Data Cleaning)
For data cleaning, we would be going through the data sets to identify missing data, inconsistent values, inconsistency between fields and duplicated data. For each row of data with missing value(s), we would then check the number of missing fields it has to decide whether we should omit the row of data completely, or should we predict a value for the missing field(s) using suitable techniques such as association rule mining and decision trees. Extracted data from EXIFTools will require the removal of irrelevant fields to remove less important data. The required fields from the exif data of videos are:
- Video height
- Video length
- Video Width