Group04 Interim
GROUP4 |
HOMEPAGE | PROJECT OVERVIEW | PROJECT FINDINGS | PROJECT MANAGEMENT | DOCUMENTATION | ANALY482 MAIN |
PROPOSAL | FINAL |
---|
Contents
- 1 Data Scraping
- 2 Data Preparation
- 3 Exploratory Data Analysis
- 4 Objective 1: Understanding Performance (Facebook Posts)
- 5 Objective 1:Understanding Performance (Facebook Comments)
- 6 Objective 3: Competitor Analysis (Youtube Videos)
- 7 Tools Used
- 8 Post Mid Terms Plans
Data Scraping
Facebook Posts & Comments
We utilized Facebook’s Graph API and using a Python code, we managed to scrape metadata information and comments from SGAG’s Facebook posts into a csv file.
YouTube Posts
Firstly, we used the Python package YouTube-DL and changed the parameters of the package to only give us the metadata of each YouTube channel and not download the video itself. Next, we ensured that the package prints out the json in the terminal itself. This was done as it originally downloads each video’s metadata into separate json files which would be extremely troublesome to parse through. These printed data were then copied into a text file and can be seen below:
With this raw uncleaned dataset, we then parse it through Python with the following code to obtain the feature that we wanted.
Finally, the printed data were copied into a csv file for further analysis.
Data Preparation
Facebook Posts
For the 3,805 data points that we have, we will be proceeding with the following steps to transform our data for analysis.
Removing Outliers
To check for outliers, we looked into our numerical variables: num_reactions, num_comments, and num_shares. After looking into several extreme outlying data points for each of these performance measures, we identified that these top performing Facebook posts are a result of virality. As viral posts may skew the overall results, we will be identifying and removing outliers using Quantile Range Analysis on JMP. In total, we removed 102 data points.
Creating New Variables
Newly created variables are identified below. Publishing time is binned into 3-hour interval (time_bin_3hr), except for 1am to 6.59am, whose posts are combined into one bin to increase sample size for analysis. We also create variables time_hr_8hr and day_published [weekday/end] to further widen day and time bins to circumvent foreseeable issues with small sample sizes.
Updated Metadata table
Below is the final metadata of the data we have transformed and cleaned.
Facebook Comments
We were able to scrape a total of 56 Facebook posts that have a total of 21,940 comments using Facebook’s Graph API and Python code. For this section, we would be looking at the number of comments and the comment message itself for analysis.
Removing Outliers
The number of comments would be used as a key indicator to identify an outlier (Similar to the method used in Facebook Posts). Among 102 outliers that have been identified in the Facebook Post dataset, 2 viral posts were found in the Facebook Comment dataset. To ensure uniformity, these 2 viral posts, which account for 7,797 comments, have been removed.
Creating New Variables
A total of 10 new variables have been created with the help of Excel, mainly index, match and weekday function, and JMP, hours and formula function. The metadata of the new variables can be seen below:
Data Transformation
Removing Emoji from Comment Message
As the scraped data were from social media, there were emojis present. To ensure that we are performing text analysis only, we decided to remove emojis from our dataset. After eyeballing the data and cross-referencing to the original posts and/or comments, we realized that emojis were scraped as symbols instead. An example of emojis (_����) appearing in a post can be seen below:
This was done using an iterative loop and utilizing the Regular Expression Python package, i.e. removing non-alphabetical and numerical strings. The code used can be seen below:
Ensuring Consistency with Different Permutations of “Hahas”
After eyeballing the data, we realised that commentators differ in terms of the permutations of their “hahas”, i.e. some users expressed this as “HAHAHAH” or “HaHa” or “hahahahaha” etc. While this difference may be relevant while performing Sentiment Analysis, i.e. all caps being more positive, it is not relevant in other analysis such as Topic Modelling and Document Clustering. An example, whereby the same author used “hahahahahaha” and “Hahahaha” in different comments, can be seen below:
Hence, we decided to replace it with just “haha” using an iterative loop and the Regular Expression Python package. This was done as stemming did not detect these words as being the same. The code used can be seen below:
Methodologies Used for Facebook Comments
To further understand Facebook comment, we would be utilizing Latent Dirichlet Allocation (LDA) to help us to explore the type of topic Facebook users are talking about. First we have selected a range of topics to look at ranging from 3 to 15. After eyeballing it, we realised that 10 topics would be the most suitable for Facebook comments. An example of the results we have gotten would be as seen below
By eyeballing the results, we are unable to identify a key category for each of these topics. Therefore, we decided to look at topics that have words whose probability that is over 0.1, to understand the topics it is trying to describe. The words that have fit the criteria would be: (i) Chicken (ii) Think (iii)Know (iv)Go Next, to understand users like and dislike of these terms, we have calculated sentiment scores using Textblob via Python package.
YouTube Posts
To help SGAG conduct a comprehensive competitor analysis on YouTube, we scraped all metal data from Night Owl Cinematic, TheSmartLocal and SGAG’s YouTube channels using YouTube-DL into .json files. Using Python, we parsed the necessary data into a csv format before importing the data into JMP Pro.
Removing Outliers
View Counts
After which, we removed outliers based off view counts as view count is the only indicator of a viral YouTube video. It is also important to note that we removed such outliers by channel as each channel would have different levels of average performances. We did this using JMP Pro's Quantile Range Outliers analysis, using the default tail quantile of 0.1 and Q scaling factor of 3. The table below shows examples of outliers:
As mentioned in the proposal report, the Pokemon Go prank was a viral video as it was a prank carried out during the Pokemon Go craze by the SGAG team who tricked the public crowd into thinking there is a Snorlax nearby, when there was none.
Published Date
To ensure a fair comparison in terms of time frame, we would be filtering the data to only include videos published after 15 October 2014. This is as the first SGAG YouTube video was published in 15 October 2014.
The table below summaries the new number of data points after removing such outliers:
Creating New Variable
We created the following new variable that may be used for analysis:
Updated Metadata Table
Below is the final metadata of the data we have transformed and cleaned.
Exploratory Data Analysis
We would be outlining the main findings and charts in the following sections below, for (a) Facebook posts, (b) Facebook comments, and (c) Youtube Posts. Please look at the report for a detailed insight into the methodologies utilized.
Objective 1: Understanding Performance (Facebook Posts)
For the interim, we will be focusing on finding out the optimal publishing time and day for Facebook posts. We identified the following numerical variables as performance measures: a. number of reactions, b. number of comments, c. number of shares. For each of the performance measure, we will look at the overall optimal time/day, as well as dividing the dataset into status_type (photo or video posts) and sponsorship (sponsored or non-sponsored posts), as seen in the roadmap below.
The table below is a summary of our findings, accompanied by the confirmatory tests done to confirm the EDA.
We can see from the table below that SGAG currently has a publishing time and day practice that is not aligned with the identified optimal time and day.
We recommend SGAG to change its publishing timing, depending on which performance measure it prioritizes on.
- If there is no preferred performance measure, we recommend time slots that are common across all performance measures: 10am - 12pm and 1pm - 3pm.
- May choose to have different publishing times, depending on the characteristics of the posts, i.e. if it is a photo or video post, if it is a sponsored or non-sponsored post
Publishing day does not matter; all the days do not result in significantly different average performance. As such, SGAG can publish across the week more evenly.
As there may be other underlying drivers of performance, SGAG should experiment by publishing on the suggested optimal timing to see if performance indeed changes over time. As such, it is important that SGAG monitors and reviews the performance of their Facebook posts, and compare it to the current performance to see if there is any improvement.
Objective 1:Understanding Performance (Facebook Comments)
Overview EDA of Facebook Comments
The graph below showcases the various analysis that we would be performing and its associated objectives.
Optimal Time to Post for Facebook Comments
The summary table below illustrates the summary of what has been identified during EDA.
Key Findings
In terms of the number of comments, SGAG yield better results when they publish a sponsored video posts. Generally, regardless of photo or video posts, the posts would die down within the first 3-4.5 hours since it has been published.
Business Insights
For sponsored post, SGAG should publish more video posts to experience high number of comments. In an event where SGAG still needs more number of comments, they can use the Facebook- Boost Option for that particular post after the post has died down (either after 3 hours or 4.5 hours).
Appealing and Impactful Facebook Topics
The table below shows the sentiment scores of the selected words to look at.
After exploring the sentiment scores, its individual wordcloud and eyeballing some of the comments, we have realized that users are generally more excited to talk about food. Among all the other three topics, ‘Go’, ‘Think’ and ‘ Know’, ‘Go’ has the highest sentiment scores. SGAG can create content for users to visit preferably about travelling outside of Singapore or food places to visit in Singapore.
Objective 3: Competitor Analysis (Youtube Videos)
Overview EDA of YouTube Video Posts
The graph below showcases the various analysis that we would be performing and its associated objectives.
It is also important to note that we would be looking at each and every performance indicator (View Counts, Net Likes and Ratings) as they are not highly correlated with each other. In addition, the table below showcases the key indicators of SGAG.
View Counts
View counts are used to judge the performance of each YouTube video. By analyzing view counts, we are able to understand the performance SGAG, vis a vis its competitors. We would also look at the trends and view share of each channel to understand what SGAG can do to perform better at a macro level.
Average View Count
As seen in the image below, Night Owl Cinematics is the clear leader in terms of average view counts whereas SGAG lags far behind. It is important to note that they are all statistically different from each other at a 95% confidence interval.
Overall View Count Trends
As seen in the graph, there seems to be a general declining trend in terms of average view counts. In fact, the average yearly growth rate is -34.99%.
However, the total view counts have been increasing since 2014. That being said, the average view counts have been decreasing as the number of videos published have been increasing consistently since 2014. These can be in seen in the graph below.
In other words, viewers are not watching every single video published and each channel is producing more varied content that appeals to different customer segments.
As seen in the graph above, The Smart Local's view counts have been growing at the fastest rate, relative to its competitors.
As seen in the table above, Night Owl Cinematics' average view count has been decreasing, but at a slower rate than the whole industry whereas The Smart Local's average view count has been increasing over the years. However, The Smart Local had a decreasing trend, starting in 2016, implying that SGAG needs to break away from the norm that the other channels have established over the years. In short, emulating Night Owl Cinematics and The Smart Local entirely would not lead to better outcomes in the long run. It is clear that their current video format and content would not help SGAG become the dominant player in the near future.
As the average view counts for each channel are statistically different from each other at a 95% confidence interview, we are able to compare across channels. As seen in the graph below, in 2017, SGAG published 31.6% of videos but only obtains a low 3.6% in terms of share of view counts.
In short, the quantity of videos published is not the driving factor of video performance. This can be best exemplified by Night Owl Cinematics, whereby it accounts for the least number of videos published but the majority of viewshare in 2017.
Summary
A summary of insights derived from analysis of view counts can be seen below:
- Night Owl Cinematics have the highest average view counts and SGAG has the lowest.
- Since 2015, average view counts of all channels are declining over time.
- The number of videos published are increasing for each channel over time.
- Total view count is increasing for each channel.
- Night Owl Cinematic’s share of viewership has been decreasing since 2015 and The Smart Local and SGAG’s share of viewership has been increasing since 2015. The Smart Local’s share of viewership has been increasing much more rapidly than SGAG.
- SGAG published 31.6% of videos in 2017 but only accounts for 3.6% of viewership. In contrast, Night Owl Cinematics published the least number of videos in 2017 but accounts for the majority of viewership.
- Night Owl Cinematics’ average view counts are declining at a slower rate than industry.
- The Smart Local’s average view counts are increasing from an overall basis but decreasing from 2016 onwards.
Collectively, it implies that:
- Viewers are not watching every single video published and that each channel is producing varied content that appeals to different customer segments.
- The quality and not quantity of videos published matters.
- Night Owl Cinematics is the clear leader in terms of view counts but The Smart Local is the leader in terms of growth.
- SGAG have to break away from the norm that the other channels have established over the years as viewer fatigue has clearly set in. SGAG needs to reinvent their content in order to break this declining trend.
- Emulating Night Owl Cinematics and The Smart Local entirely would not lead to better outcomes in the long run. It is indeed clear that their video format and content would no longer work in the near future, in that it would not guarantee that SGAG becomes the market leader.
Net Likes
Net likes are another indicator that can be used to judge the performance of each YouTube video. This is as an extremely unpopular video can go viral and obtain a huge number of view counts too. Hence, Net Likes shows exactly how desirable each video is perceived by its users. This would be a leading indicator as users would only click and view videos that are perceived as being desirable.
Overall Average Net Likes
As seen in the graph, SGAG has the lowest average net likes whereas Night Owl Cinematics have the highest average net likes. This difference is statistically significant at a 95% confidence interval.
Average Net Likes Trend
As seen in the graph, there seems to be a general declining trend in terms of average view counts. The average yearly growth rates is at -20.71%.
To better understand the source of this trend, we then split the number of net likes into likes and dislikes. As seen in the table below, the like and dislike count are positively correlated with the net like count, with the like count being highly correlated (0.9988) with net like count and dislike count being moderately correlated with net like count (0.696). These insights were derived from a correlation analysis at a 95% confidence interval. This implies that the main driving force behind the decreasing trend in net like count comes from the decrease in like count, relative to dislike count.
A graphical representation of these trends can be seen below.
After zooming into each channel individually, we realised that The Smart Local is the only channel whose average net like counts is increasing with time whereas Night Owl Cinematics and SGAG are decreasing with time. This can be seen in the summary table below.
As seen in the pie chart below, in 2017, Night Owl Cinematics have the highest net likes share, however, their share has been dropping dramatically from 96% in 2014 to 55% in 2017 due to the increasing number of net likes from SGAG and The Smart Local. It is also important to note that SGAG has the least share from 2014 to 2017. More details would be explored below.
Ratio of Average Like and Dislike Count
After confirming that the average likes and dislikes of each channel are statistically different from each other, we can then proceed to calculate the ratio of average like and dislike count. This was done to obtain a fair comparison across channels, as seen in the table below, Night Owl Cinematics have the lowest ratio, implying that their videos are relatively more polarising than SGAG and The Smart Local. The Smart Local is the leader in terms of this ratio.
In addition, SGAG should take a look at their top and bottom 10 performing videos (in terms of this ratio), to understand what worked and what did not. The top and bottom 10 videos can be seen in the table below.
Average Net Likes per View Trends
As seen in the graph below, all channels have been increasing in terms of average net likes per view in 2017. However, it is clear that Night Owl Cinematics has been lagging behind SGAG and The Smart Local. In addition, the average net likes per view is not statistically different for SGAG and The Smart Local in 2017, implying that viewers think that each channel is as positive as each other.
Summary
A summary of insights derived from analysis of net likes can be seen below:
- Night Owl Cinematics have the highest average net like count while SGAG has the lowest.
- In terms of total net like count share, Night Owl Cinematics has the majority share but its percentage share has been steadily declining since 2014. The Smart Local has been increasing rapid whereas SGAG has increased slightly and has been the lowest consistently.
- The average net like count of all channels collectively has been decreasing since 2015.
- Total number of net like count of all channels collectively have been increasing.
- The Smart Local is the leader in terms of the ratio of average like and dislike count whereas Night Owl Cinematic is the last.
- The Smart Local is the only channel whose average net like count is increasing with time, with an increasing average net like per view. SGAG has decreased the most.
- The Smart Local and SGAG are performing equally well in terms of average net likes per view in 2017.
Collectively, it implies that:
- Viewers are leaving less likes on average for each video.
- SGAG’s low average net like count stems from the low number of views and not because viewers do not like their videos as much.
- SGAG needs to investigate the top 10 and bottom 10 videos as highlighted above, learning from what worked and what did not.
- SGAG’s and The Smart Local’s videos are not as polarizing as Night Owl Cinematics’ videos.
- The Smart Local is the leader in terms of likes for videos.
- Users perceive SGAG and The Smart Local’s to be most positive in 2017.
- The Smart Local has performed consistently well in terms of average net like count and is the leader.
Average Ratings
Another performance measure for YouTube would be YouTube’s very own average ratings, this would help SGAG to understand how they perform and measure up vis a vis its competitors.
Overall Average Ratings
As seen in the table below, SGAG and The Smart Local have the highest average ratings as they are not statistically different from each other at a 95% confidence interval. They are much higher than Night Owl Cinematics' average ratings.
Average Ratings Trends
In terms of trend analysis of average ratings, the industry is growing on a yearly basis at an average of 0.41%. SGAG and The Smart Local are growing at a faster rate of 1.01% and there is slight declining trend for Night Owl Cinematics (-0.38%). This can be seen in the table below.
Summary
A summary of insights derived from analysis of average ratings can be seen below:
- The Smart Local and SGAG are performing equally well and much better than Night Owl Cinematics and around the same level as the industry.
- There is an increasing trend after 2015 in terms of average ratings collectively for all channels.
- The Smart Local & SGAG are performing equally well and well above Night Owl Cinematics, in terms of growth of average ratings.
Collectively, it implies that:
- SGAG should learn from The Smart Local instead of Night Owl Cinematics as The Smart Local is the leader in terms of average ratings.
Conclusion
The above analysis allows us to judge SGAG’s position in the industry based off various indicators. The table below summarizes how SGAG is doing, in a macro manner, across all features as a whole. In addition, this would show us which channel to further analyze after midterms and allow SGAG to learn best practices from the industry’s leader.
As seen in the table above, it is clear that we should take a deep dive into The Smart Local as it is the upcoming rising star in the industry, and is even the clear market leader in certain parameters.
Tools Used
We used the following tools to perform the above analysis:
- Python
- Tableau
- JMP Pro
- Excel
We used the following Python packages to scrape and perform the above analysis:
- YouTube-DL
- TextBlob
- RE
Post Mid Terms Plans
Objective 1: Understanding Performance (Facebook Posts)
Discovering other performance drivers of Facebook posts
For the interim, we focused on finding the optimal publishing day and time for facebook posts to improve performance of content. However, there may be other drivers that affect performance measures as well. We will be exploring how to uncover these drivers post midterm using the following analyses:
- K-means clustering
- Determine if each cluster reveals any characteristics
- Look into categorical variables in each cluster’s posts for information e.g. any cluster with a very high/low number of photo posts
- Multilinear regression
- Identify, for each performance measure, which variables are significantly related to performance
Understand performance of sponsorship
While better performance in Facebook posts in general increases exposure and builds better brand reputation, ultimately, sponsorship affects the company’s bottom line. As such, we will be looking into existing sponsored posts identified in the dataset.
- Determine the (a) sponsor, (b) content, and (c) performance to identify if there is any relationship
- Look into various reactions (e.g. angry, love) of each sponsored post to understand audience reaction
- Identify clients that most frequently engage with SGAG to generate insights specific to these clients
Create control chart
We will also prepare a control chart for SGAG to track and review performance of their Facebook posts, should they decide to post in accordance to our recommended times and days.
Objective 1: Understanding Performance (Facebook Comments)
- Look for negative topics based on number of negative reactions
- To understand what SGAG is doing wrong
- Look for positive topics based on number of positive reactions
- To understand what SGAG is doing right
- Explore if the comment behaviour change across the days
Objective 3: Competitor Analysis (Youtube Videos)
As we now understand which channel is the market leader and the one to emulate, we can now attempt to learn the best practices from this channel. We would be focusing on the following post midterms:
- Obtaining the best time and day to post a video
- Understanding what makes The Smart Local the leader among these competitors and how can SGAG emulate it.
In addition, we would be performing text analysis on a separate dataset, namely YouTube comments.