Difference between revisions of "AY1617 T5 Team AP Methodology Mid-Term"

From Analytics Practicum
Jump to navigation Jump to search
Line 19: Line 19:
 
</div>
 
</div>
 
<br><br>
 
<br><br>
==<font face ="Impact" color= #00ADEF size="5">Revised Methodology</font>==
+
==<font face ="Impact" color= #00ADEF size="5">DATA COLLECTION</font>==
 +
Our team made a trip to SGAG’s office to extract the video data from SGAG’s Facebook page (5 sets of data). The total number of video data point extracted was 299 for the year of 2016. However, after looking through the data set, we had come to realise that some data that are important for our analysis to meet the set objectives were missing. Therefore, we decided to manually log down the data ourselves by watching one year’s worth of data. Therefore, at the end of the day, we have 7 sets of data.
 +
 
 +
==<font face ="Impact" color= #00ADEF size="5">DATA INTEGRATION</font>==
 +
The integration of 5 data sources is done to form up 30 over data columns for the main analysis.Our team used JMP Pro for our data integration process. Data were matched using Post ID. Post ID is a identifier dedicated to each post in Facebook. As all the data sets are from Facebook, the data can be joined together using the Post ID. As some of the data downloaded contain repeated columns like Permalink (Permanent Uniform Resource Locator), we selected the columns from the tables during the join process.
 +
 
 +
==<font face ="Impact" color= #00ADEF size="5">DATA CLEANING</font>==
 +
<font face ="Impact" color= #00ADEF size="5">Outlier</font><br>
 +
To clean the data, our group ran a distribution analysis on all the 299 data points to check for any missing or abnormal data. Upon doing so, we observed that there was a particularly high data point which could affect the data. After checking the data point, we then proceed to exclude it from our analysis.
 +
 
 +
<font face ="Impact" color= #00ADEF size="5">Missing Data</font><br>
 +
Another observation was of instances of missing data where videos which were shared to SGAG’s Facebook page. As those posts were shared by a third party, information such as the number of comments, likes and shares are not logging in SGAG's data set. Therefore, we proceed to exclude them as well.
 +
 
 +
<font face ="Impact" color= #00ADEF size="5">Duplicated Data</font><br>
 +
Our team also checked the data for duplicate values and realised that there were actually two post which were duplicate. This have also been excluded from the data set.
 +
 
 +
==<font face ="Impact" color= #00ADEF size="5">DATA PREPARATION</font>==
 +
To facilitate ease of analysis by year then by quarter, we added a month column, to allow us to easily filter out the months of the year. In addition, we have also done the data transformation of our 4 KPIs as mentioned in our Findings using the Johnson Su transformation so as to make sure that they all follow a normal distribution.
 +
 
 +
==<font face ="Impact" color= #00ADEF size="5">REVISED METHODOLOGY</font>==
 
For our more in-depth analysis, we have decided to do go with the Principal Component Analysis, which uses an octagonal transformation to convert our 4 highly correlated variables, as seen in our multivariate analysis, into a set of values of linearly uncorrelated variables. From there, we will then do a Fit Model, plot out the quantiles, means and standard deviation, then proceed to compare the means using the All Pair, Tukey HSD function. Further into the project, we would also like to do Latent Class Analysis, Latent Semantic Analysis SVD and Topic Analysis Rotated SVD to analyse the comments of particular groups of videos which are under-performing.
 
For our more in-depth analysis, we have decided to do go with the Principal Component Analysis, which uses an octagonal transformation to convert our 4 highly correlated variables, as seen in our multivariate analysis, into a set of values of linearly uncorrelated variables. From there, we will then do a Fit Model, plot out the quantiles, means and standard deviation, then proceed to compare the means using the All Pair, Tukey HSD function. Further into the project, we would also like to do Latent Class Analysis, Latent Semantic Analysis SVD and Topic Analysis Rotated SVD to analyse the comments of particular groups of videos which are under-performing.

Revision as of 22:46, 17 February 2017

SGAG HOME INACTIVE.PNG
SGAG OVERVIEW INACTIVE.PNG
SGAG MET ACTIVE.PNG
SGAG PM INACTIVE.PNG







SGAG FINDINGS INACTIVE.PNG
SGAG DOC INACTIVE.PNG
SGAG AU INACTIVE.PNG
SGAG LOGO.PNG






SGAG INITIAL.PNG
SGAG MT ACTIVE.PNG
SGAG FINALS.PNG



DATA COLLECTION

Our team made a trip to SGAG’s office to extract the video data from SGAG’s Facebook page (5 sets of data). The total number of video data point extracted was 299 for the year of 2016. However, after looking through the data set, we had come to realise that some data that are important for our analysis to meet the set objectives were missing. Therefore, we decided to manually log down the data ourselves by watching one year’s worth of data. Therefore, at the end of the day, we have 7 sets of data.

DATA INTEGRATION

The integration of 5 data sources is done to form up 30 over data columns for the main analysis.Our team used JMP Pro for our data integration process. Data were matched using Post ID. Post ID is a identifier dedicated to each post in Facebook. As all the data sets are from Facebook, the data can be joined together using the Post ID. As some of the data downloaded contain repeated columns like Permalink (Permanent Uniform Resource Locator), we selected the columns from the tables during the join process.

DATA CLEANING

Outlier
To clean the data, our group ran a distribution analysis on all the 299 data points to check for any missing or abnormal data. Upon doing so, we observed that there was a particularly high data point which could affect the data. After checking the data point, we then proceed to exclude it from our analysis.

Missing Data
Another observation was of instances of missing data where videos which were shared to SGAG’s Facebook page. As those posts were shared by a third party, information such as the number of comments, likes and shares are not logging in SGAG's data set. Therefore, we proceed to exclude them as well.

Duplicated Data
Our team also checked the data for duplicate values and realised that there were actually two post which were duplicate. This have also been excluded from the data set.

DATA PREPARATION

To facilitate ease of analysis by year then by quarter, we added a month column, to allow us to easily filter out the months of the year. In addition, we have also done the data transformation of our 4 KPIs as mentioned in our Findings using the Johnson Su transformation so as to make sure that they all follow a normal distribution.

REVISED METHODOLOGY

For our more in-depth analysis, we have decided to do go with the Principal Component Analysis, which uses an octagonal transformation to convert our 4 highly correlated variables, as seen in our multivariate analysis, into a set of values of linearly uncorrelated variables. From there, we will then do a Fit Model, plot out the quantiles, means and standard deviation, then proceed to compare the means using the All Pair, Tukey HSD function. Further into the project, we would also like to do Latent Class Analysis, Latent Semantic Analysis SVD and Topic Analysis Rotated SVD to analyse the comments of particular groups of videos which are under-performing.