Group04 Project Findings

From Analytics Practicum
Jump to navigation Jump to search
GROUP4  
04HOMEPAGE.png HOMEPAGE   04OVERVIEW.png PROJECT OVERVIEW   04FINDINGS.png PROJECT FINDINGS   04PM.png PROJECT MANAGEMENT   04DOCUMENTATION.png DOCUMENTATION   04MAIN.png ANALY482 MAIN  
PROPOSAL INTERIM FINAL



Data Scraping

To fulfil the objectives mentioned, we scraped data from platforms whereby SGAG have a strong presence on, namely Facebook and YouTube.

Facebook Posts

We used Facebook's Graph API to scrape 3,806 SGAG Facebook posts. A sample of the content can be seen below:

status_id status_message status_type status_link status_published num_reactions num_comments
378167172198277_
1975245405823771
We all know someone who takes a lot of sick leaves _Ù÷â photo https://www.facebook.com/sgag.sg/photos/a.3781774
95530578.106131.378167172198277/197524457582
3854/?type=3
29/11/17 6:00 279 16
num_shares num_likes num_loves num_wows num_hahas num_sads num_angrys
73 197 0 1 80 1 0

In addition, we would be creating a new feature, the number of positive reactions. This is defined as the sum of total number of ‘likes’, number of ‘loves’, number of ‘wows’ and number of ‘hahas’.


Facebook Comments

Next, we scraped 21,940 SGAG Facebook comments and a sample of the content can be seen below:

comment_id status_id parent_id comment_message comment_author comment_published comment_likes
1975245405823771_1980256198656025 378167172198277_1975245405823771 "Boss, I just got into an accident and broke my arm, fractured a rib, and I might have internal bleeding"

Boss: "Ok la. So what time you coming into the office later?"

Leorenzo Joseph 29/11/17 6:06 5


YouTube Comments

Using an online web scraper, we scraped the first 10,000 comments for Night Owl Cinematic and TheSmartLocal's YouTube videos. In addition, we also scarped the first 1,000 comments for SGAG YouTube videos. A sample of the content can be seen below:

id user date commentText likes hasReplies numberOfReplies
UgzkUyiEd5tMlCq4Nwh4AaABAg Frentzen 29 minutes ago Single better 0 FALSE 0


Data Cleaning

Before carrying out further analyses, we would first be looking into the data we have and understand how we can prepare them for meaningful interpretation. We have the following datasets:

  • Facebook posts

  • Facebook comments on posts

  • Youtube video comments (inclusive of SGAG’s and its competitors’) 


For each dataset, we will conduct several steps to clean and prepare the data. Specifically, we will be looking into missing values, outliers, and data transformation.

Missing Values

While we have used reliable scraping methods to obtain the data and it is unlikely that any values will be missed out due to errors, we will still take measures to identify if there are any. The values in our datasets can be obtained manually by looking through the Facebook/Youtube comments and posts. Should there be any missing values, we could simply look for the specific comment or post and input the missing values ourselves.

Outliers

Next, we will identify outlier data points for the numerical variables of each dataset, and determine if they are a result of errors. If they are, similar to missing values, we will look for the actual values and manually replace them.

Emojis

As we are scraping content from social media, there are Emojis present. However, as the scraped Emojis are in the form of non numerical or alphabetical format, e.g. "_Ù÷â" etc, we would remove them from the dataset to ensure that we are able to conduct meaningful text analysis.

Data Transformation

To tailor the datasets to our analysis needs, we will carry out the necessary data transformation, such as creation of dummy variables, aggregation, discretising variables, and removing unnecessary characters.


Understanding Data

SGAG Facebook Posts

The table below shows the summary statistics of the scraped Facebook posts:

Summary stats table.png
  • As seen in the table above, Coefficient of Variation is fairly large for all of the factors, suggesting that the quantity of each factor does indeed vary extremely widely.
  • The number of comments is drastically lesser than the number of reactions, at 7.3% of the number of reactions, which shows that the consumers have minimal engagement with the content.
  • Consumers generate a small number of negative reactions, which is consistent with the company's mission - to generate positive content.


The diagram below shows the distribution of the number of reactions for each scraped Facebook post:

Distribution facebookposts.png
  • As seen in the diagram, the distribution follows a Power Law distribution. This is also seen across the other factors.
  • SGAG’s project sponsor’s gut feeling was correct, in that some posts went extremely viral whereas the majority of the other posts receive little to no reaction.
  • Identified that content generate more reactions when SGAG generates contents that ride on the hype, i.e. Pokemon Go.


The diagram below shows the average number of comments, shares and positive reactions for memes and videos:

Memesvsvids.png
  • The number of user engagement is higher for videos as compared to memes, probably due to the nature of the video, which has more information and content.



SGAG Facebook Comments

The figure below shows the average number of comment likes that each author receives.

Avg comments.png
  • Such authors who made a comment with a high number of likes are generally social influencers with high levels of degree of centrality.


The figure below shows only shows comments that have replies on it.

Avg comments flitered.png
  • This makes up 47.8% of all comments posted.
  • Generally, commentators that reply the most often to the comments that were made in the SGAG's Facebook post are quite dispersed. However, there are a few notable commentators that have replied more as compared to the rest, e.g. Crystal Lee, and this would be crucial for SGAG to understand the nature of their network.



YouTube Comments

The tables below shows the summary statistics for Night Owl Cinematics, TheSmartLocals and SGAG.

Night Owl Cinematics:

NOC.png

TheSmartLocal:

TheSmartLocal.png

SGAG:

SGAG.png

  • SGAG's video performance in terms of the number of replies is comparable to TheSmartLocals and Night Owl Cinematics.
  • However, SGAG falls short in terms of the number of likes for the comment received. TheSmartLocal has the highest number of likes, followed by Night Owl Cinematics.
  • SGAG is relatively less active in commenting and replying to their YouTube videos.
  • The distribution of all 3 companies follows the Power Law distribution.


Methodology

We would be categorising the proposed methodology based off the business objectives/problems. Note that detailed explanations on the methodology can be found in the submitted project proposal.

Objective 1: Predicting performance of historical content

a. Dashboarding

To allow SGAG to better predict the performance of their content, we would firstly need to allow SGAG to understand their current performance. To do so, we would be creating a summary page/dashboard that clearly summarizes key performance indicators. They are as follows:

  • We would perform Sentiment Analysis on the first 1,000 SGAG Facebook comments and report summary statistics of these sentiment scores.
  • Using SGAG’s Facebook comments, we would analyse SGAG’s network and provide degree centrality measures. This would be done via a 2 degree egocentric directed network, with the number of “likes” each comment receives as the weightage of each edge/edge attribute. An example of such a network can be seen below:
Samplenetwork.png
  • Finally, we would provide a list of summary statistics for centrality tendency measures. The features that we would be summarizing are the number of likes each comment receives, the timing of comment posts as a categorical variable, number of comments, shares, reactions, positive reactions for SGAG’s Facebook posts. Finally, we would bin the timing of content posts as a categorical variable and understand its corresponding performance.


b. Document clustering

Based off the scraped comments, we would cluster them based off document clustering via k means clustering. We would then perform topic modelling within each cluster to better understand the different clusters. Next, we would note the distribution of the number of positive reactions in each cluster. Finally, we would use ANOVA or z-test to determine if the clusters do differ in terms of the number of positive reactions.

Through understanding this, SGAG would then be able to know what kind of content would generate the most number of positive reactions. This would also allow SGAG to understand if the generated content is having the desired effects on their consumers.


c. Overall topic modelling and understanding performance of specific topics

Next, we would perform topic modelling on the scraped comments via Latent Dirichlet Allocation (LDA). We would then pick prominent and relevant topics from the LDA models and zoom into comments that talk about such topics. Within these individual topics itself, we would perform sentiment scoring and obtain the summary statistics of these sentiment scores. We would repeat the above analysis for Facebook posts that makes up the top 10% of user engagements to understand what drives the performance of top performing Facebook posts.

Such an analysis would allow SGAG to better understand which aspects of their content are doing well and which are not. We would then be able to recommend to SGAG which topics that their content team should focus on and create high-level guidelines for SGAG to drive the performance of their content.


d. Multi-linear regression analysis

To understand what drives the performance of SGAG’s Facebook posts, we will be performing multilinear regression analysis to identify important performance drivers.

We determined five performance measures, which we will use as dependent variables in our analysis: (1) number of reactions, (2) number of positive reactions, (3) number of negative reactions, (4) number of comments, and (5) number of shares.

Based on our scraped data, we were able to derive four independent variables that may possibly drive performance: (1) status type (i.e. whether the post contains photo or video), (2) day of publish, (3) time of publish, (4) length of tagged message in the post.

We will construct five different regression models, each with a performance measure as the dependent variable, and all four performance drivers as independent variables. The insights from the analysis will allow us to identify which performance drivers will contribute most to each performance measure, and allow SGAG to understand which aspects of their Facebook posts they would have to focus on.


e. K-means clustering analysis

We would be clustering the performance of Facebook posts via K means clustering. The variables that would be considered for clustering process would be as as follows: Status_type, Status_published, Num_reactions, Num_likes, Num_loves, Num_wows, Num_hahas, Num_sads, Num_angrys, Number of positive reactions, Num_comments and Num_share. Next, we would conduct z-score profiling on the various clusters to create a meaningful interpretation to come up with a business recommendation for SGAG.

The aim of this clustering exercise is twofold, namely to allow SGAG to better understand their customers and posts, and to create relevant business recommendations to drive performance of future SGAG content.

Objective 2: Understanding what content to publish for new clients and/or projects

Firstly, we would scrape Tweets that are relevant to SGAG’s clients and that are from Singapore. Next, we would create WordClouds to better visualize these Tweets. Finally, we would then perform Topic Modelling and Sentiment Analysis on them. This analysis is similar to the analysis performed in the section on “Overall topic modelling and understanding performance of specific topics”, however, it would be in the context of these scraped Tweets instead.

Objective 3: Competitor Analysis

a. Dashboarding

Here, we will delve into how well SGAG’s competitors are performing in comparison to themselves, specifically for video content on distributed via Youtube platform. They are as follows:

  • WordClouds of YouTube comments.
  • Sentiment Analysis of YouTube comments.


b. Overall topic modelling and understanding performance of specific topics

We would perform sentiment analysis and topic modelling on the YouTube comments, the steps would be a duplicate of the steps outlined in the section on “Understanding what content to publish for new clients and/or projects”. However, it would be in the content of the scraped YouTube comments. Through such analysis, SGAG can better understand how consumers perceives their competitors and learn from their competitors’ mistakes and strengths.


c. Hypothesis Testing

To determine if SGAG’s mean sentiment score is significantly different from its competitors’, we will be performing an independent sample T-test. The null hypothesis would be that the means of each brand’s sentiment scores are the same. Should the mean of the sentiment score be the same or lower than its competitors', it will prompt SGAG into taking actions to improve its sentiment scores to outperform its competitors in the future.


Proposed Deliverables

1. Predicting performance of content

  • Dashboard report of proposed analysis and source code for easy replication with updated information (for components on network analysis)
  • Insights on document clustering.
  • Insights on topic modelling and performance of specific topics.
  • Insights on multi-linear regression analysis.
  • Insights on k-means clustering.


2. Understanding what content to publish for new clients and/or projects

  • Sample report for proposed analysis.
  • Source code to scrape Twitter and perform topic modelling via LDA.


3. Competitor analysis

  • Dashboard report of proposed analysis.
  • Insights on topic modelling and performance of specific topics.
  • Insights on hypothesis testing.