AY1617 T5 Team AP Methodology Mid-Term
Contents
DATA COLLECTION
Our team made a trip to SGAG’s office to extract the video data from SGAG’s Facebook page (5 sets of data). The total number of video data point extracted was 299 for the year of 2016. However, after looking through the data set, we had come to realise that some data that are important for our analysis to meet the set objectives were missing. Therefore, we decided to manually log down the data ourselves by watching one year’s worth of data. Therefore, at the end of the day, we have 7 sets of data.
DATA INTEGRATION
The integration of 5 data sources is done to form up 30 over data columns for the main analysis.Our team used JMP Pro for our data integration process. Data were matched using Post ID. Post ID is a identifier dedicated to each post in Facebook. As all the data sets are from Facebook, the data can be joined together using the Post ID. As some of the data downloaded contain repeated columns like Permalink (Permanent Uniform Resource Locator), we selected the columns from the tables during the join process.
DATA CLEANING
Outlier
To clean the data, our group ran a distribution analysis on all the 299 data points to check for any missing or abnormal data. Upon doing so, we observed that there was a particularly high data point which could affect the data as seen in the figure below. After checking the data point, we then proceed to exclude it from our analysis.
Missing Data
Another observation was of instances of missing data where videos which were shared to SGAG’s Facebook page. As those posts were shared onto SGAG's Facebook Page, information such as the number of comments, likes and shares are not logging in SGAG's data set. Therefore, we proceed to exclude them as well.
Duplicated Data
Our team also checked the data for duplicate values and realised that there were actually two post which were duplicate. This have also been excluded from the data set.
DATA PREPARATION
Month Column
To facilitate ease of analysis by year then by quarter, we added a month column, to allow us to easily filter out the months of the year to form the 4 quarters.
Transforming the KPI columns
In addition, our client has identified 4 key performance indicators (KPI) for the videos - the number of unique views, number of likes, number of shares and number of comments.
Due to the huge difference in the range that these KPIs fall under, data transformation has to be done to normalize them. We have adopted the Johnson Su transformation for all 4 of the variables to follow a normal distribution and also to ensure that they relatively fall under the same range. (i.e. approximately between -3 and 3)
After performing the Johnson Su transformation, we then carried out a distribution analysis to check if the 4 transformed variables follow a normal distribution before we used them for further analysis, as seen from the figure below.
REVISED METHODOLOGY
Principal Component Analysis
For our more in-depth analysis, we have decided to employ a method called Principal Component Analysis (PCA), which uses an octagonal transformation to convert our 4 highly correlated variables, as seen in our multivariate analysis, into a set of values of linearly uncorrelated variables. From there, we will then do a Fit Model using the principal components and the desired variables, plot out the quantiles, means and standard deviation, then proceed to compare the means using the All Pair, Tukey HSD function.
Currently, we have the PCA done on the full year of data and by quarters to compare videos that were posted during the same time period. Additionally, PCA is also done on self-produced videos and videos starring the 4 main characters. Pairing PCA and Tukey HSD function, we were able to see which variables were doing better or worse than the others with 95% confidence and which variables does not seem to contribute to the videos’ performance (absence of P-value less than 0.05).
Nonparametric Analysis
In some cases, the PCA was not able to give us values that were statistically significant in terms of the difference in performance due to the small data size. For example, our PCA suggests that sponsored or unsponsored content does not impact the videos’ performance. However, it might not actually be the case. It might be due to our small data size that is not allowing the PCA to show a statistical difference in video performance for sponsored versus unsponsored content.
Therefore, our team would like to look into nonparametric analysis, which relies on non-statistical methods, to identify which variables are producing better video performances. As for the method of analysis, we would be testing out all of the test functions (Wilcoxon, Median, van der Waerden, and Kolmogorov Smirnov Tests) to see which yields the best results for us.
Simulation using Profiler
Other than using the nonparametric tests to tackle the problems we face with a small data set, we would also like to try simulating more data using the profiler. The profiler basically makes use of our current data set (training data set) to predict more possible data points. With more data point, we can hopefully be able to get more meaningful results from our PCA.
Text Exploring
Further into the project, we would also like to do Latent Class Analysis, Latent Semantic Analysis SVD and Topic Analysis Rotated SVD to analyse the comments of particular groups of videos which are under-performing, especially for videos which stars Xiao Ming. He currently is the most featured character and we are curious to know exactly why were his videos underperforming as compared to the rest of the other characters, and we believe that we might be able to get more insights through text mining the comments in his videos.