Difference between revisions of "Group04 Project Findings"

From Analytics Practicum
Jump to navigation Jump to search
Line 117: Line 117:
 
The diagram below shows the average number of comments, shares and positive reactions for memes and videos:
 
The diagram below shows the average number of comments, shares and positive reactions for memes and videos:
  
[[Image: memesvsvids.png |600px|middle]]
+
[[Image: memesvsvids.png |300px|middle]]
 
* The number of user engagement is higher for videos as compared to memes, probably due to the nature of the video, which has more information and content.
 
* The number of user engagement is higher for videos as compared to memes, probably due to the nature of the video, which has more information and content.
  

Revision as of 11:12, 10 January 2018

GROUP4  
04HOMEPAGE.png HOMEPAGE   04OVERVIEW.png PROJECT OVERVIEW   04FINDINGS.png PROJECT FINDINGS   04PM.png PROJECT MANAGEMENT   04DOCUMENTATION.png DOCUMENTATION   04MAIN.png ANALY482 MAIN  
PROPOSAL INTERIM FINAL



Data

Data Scraping

To fulfil the objectives mentioned, we scraped data from platforms whereby SGAG have a strong presence on, namely Facebook and YouTube.

Facebook Posts

We used Facebook's Graph API to scrape 3,806 SGAG Facebook posts. A sample of the content can be seen below:

status_id status_message status_type status_link status_published num_reactions num_comments
378167172198277_1975245405823771 We all know someone who takes a lot of sick leaves _Ù÷â photo https://www.facebook.com/sgag.sg/photos/a.378177495530578.106131.378167172198277/1975244575823854/?type=3 29/11/17 6:00 279 16
num_shares num_likes num_loves num_wows num_hahas num_sads num_angrys
73 197 0 1 80 1 0

In addition, we would be creating a new feature, the number of positive reactions. This is defined as the sum of total number of ‘likes’, number of ‘loves’, number of ‘wows’ and number of ‘hahas’.

Facebook Comments

Next, we scraped 21,940 SGAG Facebook comments and a sample of the content can be seen below:

comment_id status_id parent_id comment_message comment_author comment_published comment_likes
1975245405823771_1980256198656025 378167172198277_1975245405823771 "Boss, I just got into an accident and broke my arm, fractured a rib, and I might have internal bleeding"

Boss: "Ok la. So what time you coming into the office later?"

Leorenzo Joseph 29/11/17 6:06 5

YouTube

Using an online web scraper, we scraped the first 10,000 comments for Night Owl Cinematic and TheSmartLocal's YouTube videos. In addition, we also scarped the first 1,000 comments for SGAG YouTube videos. A sample of the content can be seen below:

id user date commentText likes hasReplies numberOfReplies
UgzkUyiEd5tMlCq4Nwh4AaABAg Frentzen 29 minutes ago Single better 0 FALSE 0

Data Cleaning

In progress


Understanding Data

SGAG Facebook Posts

The table below shows the summary statistics of the scraped Facebook posts:

Summary stats table.png

  • As seen in the table above, Coefficient of Variation is fairly large for all of the factors, suggesting that the quantity of each factor does indeed vary extremely widely.
  • The number of comments is drastically lesser than the number of reactions, at 7.3% of the number of reactions, which shows that the consumers have minimal engagement with the content.
  • Consumers generate a small number of negative reactions, which is consistent with the company's mission - to generate positive content.

The diagram below shows the distribution of the number of reactions for each scraped Facebook post:

Distribution facebookposts.png

  • As seen in the diagram, the distribution follows a Power Law distribution. This is also seen across the other factors.
  • SGAG’s project sponsor’s gut feeling was correct, in that some posts went extremely viral whereas the majority of the other posts receive little to no reaction.
  • Identified that content generate more reactions when SGAG generates contents that ride on the hype, i.e. Pokemon Go.

The diagram below shows the average number of comments, shares and positive reactions for memes and videos:

Memesvsvids.png

  • The number of user engagement is higher for videos as compared to memes, probably due to the nature of the video, which has more information and content.

SGAG Facebook Comments

  • Identified that authors who made a well-liked comment (high number of likes) are generally social influencers with high levels of degree of centrality.
  • Generally, commentators that reply the most often to the comments that were made in the SGAG's Facebook post are quite dispersed. However, there are a few notable commentators that have replied more as compared to the rest and this would be crucial for SGAG to determine the network strength.

YouTube Comments

  • SGAG's video performance in terms of the number of replies is comparable to TheSmartLocals and Night Owl Cinematics.
  • However, SGAG falls short in terms of the number of likes for the comment received. TheSmartLocal has the highest number of likes, followed by Night Owl Cinematics.
  • SGAG is relatively less active in commenting and replying their YouTube videos.


Methodologies

We would be categorising the proposed methodology based off the business objectives.

1. Predicting performance of historical content

Dashboarding

To allow SGAG to better predict the performance of their content, we would firstly need to allow SGAG to understand their current performance. To do so, we would be creating a summary page/dashboard that clearly summarizes key performance indicators. They are as follows:

  • We would perform Sentiment Analysis on the first 1,000 SGAG Facebook comments and report summary statistics of these sentiment scores.
  • Using SGAG’s Facebook comments, we would analyse SGAG’s network and provide degree centrality measures. This would be done via a 2 degree egocentric directed network, with the number of “likes” each comment receives as the weightage of each edge/edge attribute.
  • Finally, we would provide a list of summary statistics for centrality tendency measures. The features that we would be summarizing are the number of likes each comment receives, the timing of comment posts as a categorical variable, number of comments, shares, reactions, positive reactions for SGAG’s Facebook posts. Finally, we would bin the timing of content posts as a categorical variable and understand its corresponding performance.


Document clustering

Based off the scraped comments, we would cluster them based off document clustering via k means clustering. We would then perform topic modelling within each cluster to better understand the different clusters. Next, we would note the distribution of the number of positive reactions in each cluster. Finally, we would use ANOVA or z-test to determine if the clusters do differ in terms of the number of positive reactions.

Through understanding this, SGAG would then be able to know what kind of content would generate the most number of positive reactions. This would also allow SGAG to understand if the generated content is having the desired effects on their consumers.


Overall topic modelling and understanding performance of specific topics

Next, we would perform topic modelling on the scraped comments via Latent Dirichlet Allocation (LDA). We would then pick prominent and relevant topics from the LDA models and zoom into comments that talk about such topics. Within these individual topics itself, we would perform sentiment scoring and obtain the summary statistics of these sentiment scores. We would repeat the above analysis for Facebook posts that makes up the top 10% of user engagements to understand what drives the performance of top performing Facebook posts.

Such an analysis would allow SGAG to better understand which aspects of their content are doing well and which are not. We would then be able to recommend to SGAG which topics that their content team should focus on and create high-level guidelines for SGAG to drive the performance of their content.


Multi-linear regression analysis

In progress.


K-means clustering analysis

We would be clustering the performance of Facebook posts via K means clustering. The variables that would be considered for clustering process would be as as follows: Status_type, Status_published, Num_reactions, Num_likes, Num_loves, Num_wows, Num_hahas, Num_sads, Num_angrys, Number of positive reactions, Num_comments and Num_share. Next, we would conduct z-score profiling on the various clusters to create a meaningful interpretation to come up with a business recommendation for SGAG.

The aim of this clustering exercise is twofold, namely to allow SGAG to better understand their customers and posts, and to create relevant business recommendations to drive performance of future SGAG content.


2. Understanding what content to publish for new clients and/or projects

Firstly, we would scrape Tweets that are relevant to SGAG’s clients and that are from Singapore. Next, we would create WordClouds to better visualize these Tweets. Finally, we would then perform Topic Modelling and Sentiment Analysis on them. This analysis is similar to the analysis performed in the section on “Overall topic modelling and understanding performance of specific topics”, however, it would be in the context of these scraped Tweets instead. .


3. Competitor Analysis

Dashboarding Here, we will delve into how well SGAG’s competitors are performing in comparison to themselves, specifically for video content on distributed via Youtube platform. They are as follows:

  • WordClouds of YouTube comments.
  • Sentiment Analysis of YouTube comments.

Overall topic modelling and understanding performance of specific topics

We would perform sentiment analysis and topic modelling on the YouTube comments, the steps would be a duplicate of the steps outlined in the section on “Understanding what content to publish for new clients and/or projects”. However, it would be in the content of the scraped YouTube comments. Through such analysis, SGAG can better understand how consumers perceives their competitors and learn from their competitors’ mistakes and strengths.


Hypothesis Testing

In progress



Proposed Deliverables

1. Predicting performance of content

  • Dashboard report of proposed analysis and source code for easy replication with updated information (for components on network analysis)
  • Insights on document clustering.
  • Insights on topic modelling and performance of specific topics.
  • Insights on multi-linear regression analysis.
  • Insights on k-means clustering.

2. Understanding what content to publish for new clients and/or projects

  • Sample report for proposed analysis.
  • Source code to scrape Twitter and perform topic modelling via LDA.

3. Competitor analysis

  • Dashboard report of proposed analysis.
  • Insights on topic modelling and performance of specific topics.
  • Insights on hypothesis testing.