From Analytics Practicum
Revision as of 19:59, 16 October 2016 by Xiuming.hoe.2013 (talk | contribs)
Jump to navigation Jump to search













To facilitate the data analysis, SGAG provided our team with the datasets of their two main social media channels which are Facebook and Twitter. Both of the datasets are obtained through the social media insights of the respective platform, ranging from September 2015 to August 2016.

Some instances of the datasets that are provided to us are SGAG Facebook Page Level Insights, Post Level Insights, Video Posts Insights and SGAG Twitter Activity Metrics.

Howerver, due to the time constraint, our team will only be focusing on Facebook.


Extracted Table
Some columns in the original dataset were extracted to a new table as the original form does not serve to perform comparison analysis.

Group 1 Table1 1.png

The figure above is a snippet of how lifetime likes by gender and age were stored in the original Page Level dataset. Lifetime Likes by Gender and Age stores an aggregated demographic data about the unique Facebook users who like SGAG's Page based on the age and gender information they provide in their user profiles. The original format does not allow a comparison of changes in daily likes of a gender by different age groups and also gained in daily likes.

Hence, we extracted the data into the following table and calculated the differences in daily likes in order to achieve our objective:

ANLY482 Group1 Table1 2.png

The first challenge that team encountered was to perform a manual reconciliation of advertiser’s data to the post’s data (advertisement) as the information of the advertisers and the relevant posts to the advertiser were available via a dropbox link stored in another Microsoft Excel spreadsheet.

The second challenge is that a manual identification of video post’s data based on the source of the video. A video posted on SGAG’s page could be in-house generated (by SGAG) or Shared Video (by other Facebook Users/Public). The identification of the source depends on keywords and the characters appeared in the video. For instance, a video post is considered a shared video if the post message of the video contains words such as “credit to” or “submitted by” <name>. On the other hand, characteristics of an in-house generated video is when any of the SGAG characters appeared in the video (e.g. Xiao Ming, Sue-Ann). When a video does not possess any of the characteristics mentioned above, our team would have to confirm the source of the video with our sponsor, Mr. Karl.

Choce of Key Measurements
In our analysis, measurements such as Reach, Engagement, Impressions, Likes, Unlikes, Comments, Shares, Negative Feedbacks, and various length of Video Views have been chosen to be the key performance indicators. Measurements such as Lifetime Post Paid Reach and Lifetime Post Total Reach will not be used as performance measurements as there is no paid posts in SGAG’s dataset. Hence, the amount of organic reach would be the same as total reach and paid post reach will always be 0. As such measurements would be redundant and meaningless in our analysis, we have then excluded it from our analytical dataset.


Several problems such as duplication of data, missing values and outliers can be found in the dataset collected from SGAG. As these issues will potentially affect the result of our analysis, suitable actions will be taken to handle such issues prior to performing our analysis.

1. Missing Values :
After examining the dataset, a few missing values can be found at page level dataset, and no missing values are found at the post and video level dataset. As the missing columns in the page level dataset contains measures such as lifetime likes and daily demographics data that is critical in the evaluation of SGAG’s overall daily performance, these dates (26 January 2016, 28 & 29 August 2016) will be removed from our subsequent page level analysis.
2. Duplicate Values :
There is no duplicate found at the page level data. However, a handful of row duplications and post message duplications can be found at both post and video level dataset. Some of the common issues found are as following:
i. Same Post Message with Different Content
ANLY482 Group1 Figure2 1.png
Figure 2.1 shows two posts that are described by exactly the same post message. After looking into the posts, we realized that both of the posts are of different content and therefore, will retain such posts in our further analysis.
ii. Identical Rows
ANLY482 Group1 Figure2 2.png
As seen in Figure 2.2, various columns such as Post ID, Permalink and Post Message are the same across the two rows. Hence, we will remove one of the rows in our dataset for such situations.
