Group04 Final

From Analytics Practicum
Revision as of 18:28, 12 April 2018 by Jytan.2014 (talk | contribs)
Jump to navigation Jump to search
GROUP4  
04HOMEPAGE.png HOMEPAGE   04OVERVIEW.png PROJECT OVERVIEW   04FINDINGS.png PROJECT FINDINGS   04PM.png PROJECT MANAGEMENT   04DOCUMENTATION.png DOCUMENTATION   04MAIN.png ANALY482 MAIN  
PROPOSAL INTERIM FINAL



Overview

In this section, we will be using nonparametric statistical tests and text analysis to understand factors that affect the performance of content. Having a clear understanding on the factors of content performance will enable the company to determine its future strategy to continuously strive for better performance.

We will explore posting times and content as factors of performance and seek an appropriate methodology to analyze their effects on content performance. To capture a wide range of audiences, the company is currently active on Facebook and YouTube. We will thus be looking at data scraped from Facebook and YouTube.

For the Facebook Post dataset, the performance of posts will be compared across posting time to determine if specific posting times will affect performance, while text analysis will be performed on consumers’ comments from the Facebook Comment dataset and YouTube dataset to identify if different surfaced topics will result in differing sentiments. After our literature review, we have chosen Topic Modeling and Sentiment Analysis as the preferred methodologies for text analysis. Also, the Median test will be used to compare performance across different posting times.

Facebook Posts

The company is concerned that publishing content on Facebook on different days and time will affect its content’s engagement performance. However, they have yet to establish a methodology to study the impact of publishing day and time on performance.

Studies have shown that identifying the optimal time to reach an audience will drive social media engagement and traffic. Due to the algorithm-based feed of Facebook, having a large audience does not necessarily translate to high viewership. Instead of viewership, reach is a metric used by Facebook to measure the number of people who has seen a particular content.

The main objective is to identify the most optimal time to publish a post that will result in the highest reach as it would drive engagement. However, we were unable to scrape this metric as it is not available publicly. We considered combining three performance metrics that were scraped (i.e. number of reactions, shares, and comments) into a single metric as a proxy for reach. However, this approach is infeasible as we have identified that each metric would accumulate data over a different length of time. While we have determined that comments are no longer made on posts over six days old, we do not have time-series data on reactions and shares to perform similar analysis and determine when the last reaction or share occurs after publication of a post.

Due to the limited scope of data scraped, we defined comments as a proxy to reach as a metric for performance. However, we acknowledge this to be a limitation, as comments are not a true representation of viewership.

Facebook Comments

For the Facebook comments, we will seek to understand how consumers perceive the respective Facebook posts. Popular topics identified within the comments and their sentiment scores will be explored using Latent Dirichlet Allocation (LDA) and Sentiment Analysis. According to literature review, sentiment analysis will allow us to identify positive and negative opinions and emotions, and will be performed using the TextBlob Python package. TextBlob was chosen, as prior research has used this Python package to perform sentiment analysis on social media. The objective is to identify possible insights and define actionable plans based on the Facebook Comment dataset.

Youtube

We will also analyze comments made on the YouTube videos through sentiment analysis, as this technique has been used for “analysis of user comments” from YouTube videos by other researchers. Performing sentiment analysis on YouTube comments is ideal as “mining the YouTube data makes more sense than any other social media websites as the contents here are closely related to the concerned topic [the video]”. We will perform sentiment analysis, using TextBlob, on scraped comments to understand the consumers’ sentiments, i.e. level of positivity, vis-a-vis the content of the published videos. These comments would be from videos published from 2017 onwards, ensuring that the analysis is still relevant given the fast-paced dynamic nature of YouTube channels.

Methods and Analysis

Facebook Posts

Comments, used as a proxy metric for reach, will be compared across publishing day and time to determine the timing that has the highest comments using nonparametric statistics tests.

Methodology

As the chosen performance indicator has a right-skewed distribution (Display 1), a nonparametric analysis Median Test will be used to test the hypothesis of the equality of population medians across categorical groups to determine if there are statistically significant differences between publishing days and time, or if the different level of performance is caused by random differences from the selected samples. Median has been chosen as the appropriate measure of central tendency, since it has a right-skewed distribution and the outliers within the dataset would significantly distort the mean.

Group04 commentsdistribution.png
Display 1: Distribution of Comments

The Median Test was chosen over Kruskal-Wallis test, as it is more robust against outliers. For datasets with extreme outliers, Median Test should be used. Furthermore, Kruskal-Wallis test requires several assumptions to hold. In particular, the assumption that variances is approximately equal across groups does not hold for several of the time groups.

Specifically, the Median test will test on the following hypotheses to compare performance across publishing days:

H0: The median number of comments across all publishing days are not significantly different from each other.
H1: The median number of comments across all publishing days are significantly different from each other.

To compare performance across publishing time bins, the Median test will test on the following hypotheses:

H0: The median number of comments across all publishing time bins are not significantly different from each other.
H1: The median number of comments across all publishing time bins are significantly different from each other.

To determine which publishing day or time groups are significantly different from others, pairwise Median Test are carried out as post-hoc tests. Finally, top performing day or time group(s) with highest medians will be identified as the optimal publishing day(s) or time(s).

The dataset will be segregated into non-viral post dataset and viral post dataset, as viral posts share several common characteristics that non-viral posts do not have. Due to the different nature of either types of posts, they should not be analyzed together since posting time may be a possible determining factor on the eventual virality of a post. Viral posts are defined as the posts whose performance is an outlier. Using the Quantile Range Outlier analysis on JMP Pro to identify outlier observations, there are 71 Facebook posts that are viral posts.

For each of the two datasets, performance of all posts will be analyzed across publishing days and times. Next, each dataset will be further broken down into different characteristics groups, such as photo posts, video posts, non-sponsored posts, and sponsored posts. Performance for each of these characteristics group will then be analyzed across publishing days and times. A visual representation of the process is illustrated in Display 2.

Group04 commentsanalysisprocess.png
Display 2: Overview of Analysis Process

Results

Group04 summarymedianresults.png
Table X: Summary of Median Tests Results

For non-viral posts, only the median comments of Video posts and Sponsored posts across the time bins are significantly different. After performing the post-hoc test, we have identified that for Video posts, the optimal publishing time is from 1PM to 3:59PM as it has the highest median comments of 130. As for Sponsored posts, the optimal publishing time is from 5PM to 7:59PM as it has the highest median comments of 71.

As for non-viral posts, there are no significant differences in the median comments of viral posts. There is no optimal time to publish that may generate a viral post defined by the number of comments. This result is not unexpected, as the main factors of viral posts typically revolve around the content of the posts. Content that generates buzz or are non-traditional are common factors found across viral content.

Business Insights

Further studies should be done to better understand the impact that publishing day and time may have on reach, instead of just using comments as a proxy for reach. The company may wish to consider starting regular collections of internal metrics data on Facebook to reanalyze their performance across publishing day and time. Facebook Insights collects and enables the downloading of time-series and cross-sectional data on multiple metrics. Through availability of these data, the company may gain a better understanding on multiple performance metrics (e.g. reach and engagement), as well as audience behavior over time using the time-series data.

Facebook Comments

Facebook comments were used to identify popular topics. We mimicked the methodology of research on Twitter, another social media platform, performing sentiment analysis on topics derived from LDA. “Topic sentiment analysis provides a more precise snapshot of the sentiment distribution”.

Methodology

aaaa

Results

The table below summarises median scores for each topic. As we have defined sentiment score over ≥ 0.5 as positive and < 0.5 as negative, commentators have positive sentiments when commenting on these 4 topics. However, topics like the transport system in Singapore and first-world problems are borderline positive which highlights that commentators either do not like the topic or the way the content was communicated i.e. humorous or serious tone. Topics 3 and 4 generate good positive sentiment scores, which shows that commentators are generally positive about them.

Business Insights

aaaa

YouTube

aaaa

Methodology

aaaa

Results

aaaa

Business Insights

aaaa