AY1516 T2 Team AP Analysis PostInterimFindings

From Analytics Practicum
Jump to navigation Jump to search

Team ap home white.png HOME

Team ap overview white.png OVERVIEW

Team ap analysis white.png ANALYSIS

Team ap project management white.png PROJECT MANAGEMENT

Team ap documentation white.png DOCUMENTATION


Data Retrieval & Manipulation (Pre Interim) Pre interim findings Post interim twitter findings Post interim plan Post interim findings

Data Retrieval

Constructing the graph from scratch involved the usage of python code to retrieve posts from SGAG's Facebook account for posts dating back 10 months. This involved connecting to the Facebook graph API programatically to formulate a csv file that resembles this structure:

Each user ID in List of Likers and List of Commenters are separated by a semicolon, and tagged to each post.
Post IDMessageTypeList of LikersList of Commenters
378167172198277_1187053787976274 Got take plane go holiday before, sure got kena one of these! Link/Photo 10206930900524483;1042647259126948; ... 10153979571077290;955321504523847; ...
... ... ... ... ...

After crawling the Facebook API for ~4.5 Hours, the result is 1600++ posts dating 10 Months ago, with a CSV file size of ~38MB. Entire code can be viewed here.

Code snippet of likers & commenters retrieval
Code snippet of conversion of CSV into GraphML format
Initial import report in GraphML

Subsequently, we wanted to visualize the data using the Gephi tool. Hence, additional python code was used to read the CSV file, programmatically reading each row of the CSV, and attaching each post ID to likers and commenters respectively. This is done so that we can construct the .graphml graph formatted file, which gephi is able to read. Entire code can be viewed here.

The resultant file (~211MB) is uploaded here for reference.

SGAG's Facebook High Level Analysis

We defined the popularity of a post by the sum of Likes and Comments it received, also known as Engagement. Before we attempt to deep dive into the Facebook posts, we first performed high level analysis about the Engagement of SGAG's Facebook posts in 2015 to observe if there is a pattern that explains the popularity of certain posts. Looking across SGAG's Facebook posts, we found that they post on international and local issues, with the latter being their focus.

Total Engagement (Number of Likes and Comments per Post)

Engagments for SGAG.png


Looking at the graph, we zoomed in on the 3 peaks circled in blue: . We wanted to identify what were the potential reasons for their popularity. First, we checked if it was due to a high number of posts being posted that led to this spike, since posting more frequently would mean more opportunity for followers to engage with the posts, which misrepresents the actual popularity of posts on a certain date.

Total Engagement per Post

Engagement to record.png

We found that the August posts were engaged with less compared to the ones in March and September, and it is largely due to the dilution of engagement numbers by the large number of posts. This further narrowed down our interest into the posts in March and September. Zooming into the data, we discovered that these posts shared 2 main common themes:

1. The tribute to Lee Kuan Yew, who passed away in March.

Lky.png

2. Singapore General Elections 2015

Poli.png

This explains how the engagement numbers to these posts were so different from that of posts from other months, since SGAG were posting on these topics that were important to their target audience and they were posting it close to the dates of announcement/ occurrence of these events. For Lee Kuan Yew's death on the 23rd of March 2015, SGAG posted 5 posts on the day itself, increasing its post frequency to 10 in the days after.

It also shows that a large portion of its follower base resonate with Singapore Politics and societal issues, which brings leads our team to look at competitors who post humor and satirical content covering similar topics. This allows us to identify the share of the humor content market, and gain a better understanding of SGAG's current standing.

Competitor Comparison & Analysis: SMRT Feedback & Mr Brown

Comp.png

We identified 2 competitors: SMRT Feedback & Mr Brown, and crawled data from their respective Facebook pages. This lets us understand SGAG's performance when benchmarked against other players in a similar market. The graph below illustrates their relative performances in terms of number of user engagements/interactions for period of Jan 2012 to Mar 2016. We hope that by doing this, we can identify possible user interaction/engagement trends for given periods of time, as well as understand the reasons for specific user engagement peaks. This initial understanding will better position SGAG's competitiveness in the industry, by understanding and adapting content strategies adopted by its competitors.


Number of Engagements
SGAG and competitors high level insights - engagements.png

From the graph above, we can tell that SGAG amongst its competitors SMRT Feedback and Mr Brown, has been garnering the most number of engagements per month. Their dominance however, is faltering from the period of Oct 2014 onwards, where it seems that SMRT feedback has started to rally high numbers of engagements amongst users. The number of engagements per month for SGAG however remains consistent around 400k. After Oct 2014, there are periods where their engagements per post actually exceeds that of SGAG's (Jan 2015, Feb 2015, Oct 2015, Nov 2015). It seems that SMRT Feedback is gaining more traction than SGAG in terms of engagements.

Effectiveness/Impact of posts
Next, we wanted to illustrate the effectiveness of posts generated. To do this, we created a derived variable: the number of engagements per post, plotted across the same time period.

SGAG and competitors high level insights V2.png

From this graph, we can make an interesting observation. Even though SGAG's number of engagements per month is higher than its competitors' (as illustrated by the previous graph), in terms of impact or effectiveness of posts, they might not be the most competitive. SGAG's number of engagements per posts are on an upward trend from Jan 2012 to Mar 2016, signalling a possible increase in viewership brought about by better content, but this increase seems trivial when compared to the impact of posts generated by its competitor, SMRT Feedback's. We can see 3 major peaks in engagements by SMRT Feedback's posts for the time periods of Dec 2012, Nov 2014 - Jan 2015, and Oct 2015 - Nov 2015. In the same time periods, SGAG's posts' engagements trailed. We wanted to explore the reasons for this phenomenon, and hence went deeper by analysing the content of posts generated by SGAG, and competitors, hoping to find possible differences in content publishing strategies that might have brought about the differences.


Gephi Analysis

The initial import into gephi constant crashes when using computers with < 8 GBs of RAM. We eventually managed to leverage a computer with 16GB of RAM to begin the initial import of the huge .graphml file. The initial report indicated that we have 296,063 Nodes and 2,142,208 edges.

Upon successful load of the file and running the Force Atlas 2 Algorithm, we achieved the below graph.

Screen Shot 2016-04-09 at 6.15.07 pm.png

As expected, the graph is extremely unreadable, and furthermore, even with a high performance machine, the Gephi application is extremely slow and painful to work with under these circumstances.

Hence, we performed a filter of the the number of degrees on the graph to obtain a condensed version below.

Screen Shot 2016-04-09 at 6.24.36 pm.png

From a high level view of the graph, we noticed that for some clusters(located at the side of the graph) have a mixture of both commenters(Green) and likers(Red), with a large proportion of interaction being "likes". Zooming into the graph, we notice that if the post was forced to isolation by the ForceAtlas algorithm:

Screen Shot 2016-04-09 at 6.17.03 pm.png

However, even though with the filter applied, working with Gephi is extremely laggy and unusable to get concrete findings, and thus we decided to look for alternatives in order to obtain additional insights.

NetworkX approach

To resolve this issue, we looked to Networkx to try and see how we can adapt the graph formation programmatically for us to better study the network.

After studying the code, we noticed there was an error in the initial CSV that we generated for all the posts( with respective likers and commenters). The main reason why this happened was because when we saved the data into Excel, each individual row has the character limit of 35,000, and thus, many of the posts likers/commenters were truncated because each user's ID is about 20characters long, and those particular posts have >1000 Likes or Comments.

Therefore we revised the crawling code that polls the Facebook API. It can be found here.

Similarly, after ~4 hours of crawling, the data is presented here.

In addition to the limitation of Excel in terms of each cell's size, we also notice there was an error as to how we were constructing the GraphML file. We were constructing the graphML file using a Unimodal approach to plot the network, where the users and posts were both modelled as the same nodes and thus not representing our dataset well.

Hence, we now modelled the network as a bipartite graph, with the expectation that it should give us a good representation of the network. We revised the code, and it can be found here. To have a distinction between likes and comments, we assigned likes to a weight of 1 and comments to a weight of 2, indicating that a comment represents a "heavier" interaction with the post than simply liking it.

Before we ran the code, we tried to obtain a small representation of the posts(5) because we expected the network plot to take more than 8 hours to complete. The 5-post bipartite graphml file can be found here.

Loading it into Gephi and running the Yifan Hu algorithm, we get this result:

Screen Shot 2016-04-12 at 9.59.17 pm.png

As shown, it is observed that users(blue nodes) that are active in multiple posts(red nodes) are clustered between the large clusters. Also, we notice that each posts popularity varies, as seen from the size of the clusters around each red node. Also, one really interesting thing to note is that there are users who interact with all 5 posts, indicating that these users might be SGAG's biggest fans.

We aim to conduct greater in depth analysis when the full network in constructed(8 Hours wait time).

Gephi Analysis Part 2

The graphml file(~320MB) for the entire network that was generated with the new code can be found here.

Upon load, we have 295,276 nodes and 2,135,622 edges. The nodes are broken down into 1825 Posts and 293,451 users, while the edges represent the interactions (Likes and Comments) between users and the posts.

Screen Shot 2016-04-13 at 2.04.40 pm.png


Graphml Data Format
IDLabelMessageTypeInteractionsDegree
Post ID 0 = Post
1 = User
Post Text Status/Link/
Photo/Video
Number of Interactions with all Posts by User Number of Direct Edges to Posts by User


Screenshot of Facebook Gephi Data Laboratory

Due to SGAG's Facebook huge social network, the Gephi mapping process was very slow, despite changes to our code to optimise the graph. Hence, our team decided to edit the data rows individually in the data laboratory to shrink down the dataset, since we couldn't even apply filters as Gephi was not even responding to it.

In this process, we filtered the dataset to include the top 10 posts with the largest number of degrees, and also the bottom 10 posts with the least number of degrees.

From the shrinked dataset, we executed a ForceAtlas algorithm to spread out the nodes, and obtained the graph below:

OverviewChart.png Legend.png

From this dataset, we found out that each follower engages with an average of 4.232 posts. To focus on the popular categories of posts, we decided to zoom into "National Event" and "Goodwill/Feel Good".

Goodwill.png

Longest path from one follower to another follower is 2 users away. This is the same as the Twitter social network graph, which shows that the Goodwill/ Feel Good segment is relatively tight.

We hope to derive a methodology that SGAG can follow in order to identify traits of high-performing posts in each segment.

In the Goodwill/ Feel Good segment, these are the 4 posts:

Post 1: Till death do us part... Read full story: http://mypaper.sg/top-stories/man-95-visited-stricken-wife-daily-40-days-20150618

Engagement: 19,478

Number of Followers that Engaged: 3,925

Picture1.png

Post 2: This is the true Singapore spirit we're talking about! Thank you Dr Chew and Dr Tan! Hope Megan recovers soon! Read: http://www.straitstimes.com/singapore/docs-acted-fast-to-get-victim-home-for-treatment#xtor=CS1-10

Engagement: 14,242

Number of Followers that Engaged: 4,726

Picture2.png

Post 3:Look who came to join the RSAF Black Knights to give LKY a final salute? #RememberingLeeKuanYew Credits to Malcolm Koh

Engagement: 10,015

Number of Followers that Engaged: 3,665

LKYSGAG.png

The number of Followers that engage with these 2 posts exclusively constitute 70.24% of all the Followers that engage with Goodwill/ Feel Good posts.

Post 4:#StartOfLKYQueue as of 11pm is at: - Hong Lim Park - Opp Peninsula Plaza Share your location+time so that people who wants to queue knows where to go!

LKYQUEUE.png

Total number of Followers in this category: 12,316

Among the 3 Top posts for Goodwill/ Feel Good, we found that there is an exceptionally large number of likers/commenters engaging with 2 of the 3.

We postulate that compared to the third Goodwill/ Feel Good post from the Top 10, which focuses on the symbolism of crows flying together with RSAF Black Knights in commemoration of Lee Kuan Yew’s death, the 2 posts are less abstract and to the point. For the fourth post, it was an informational post rather than a celebration or remembrance, which possibly explains its poor performance. The commonalities in themes we found between the first 2 posts are:

  • Going beyond the call of duty
  • Love and care for one another
  • Personified by actual people doing acts of goodwill

Based on this sample, these are the characteristics that SGAG can look to emulate in their Goodwill/ Feel Good posts to generate high-performing posts.

Recommendations - Market Basket Analysis

For further analysis and recommendations, we propose the use of Market Basket Analysis (MBA) that is based on the content of the SGAG Facebook post. We ran a MBA using an Excel plugin called XLStat, based on a dataset of 19 transactions, with 40 various items, generating a total of 1148 rules. We feel that SGAG will be able to use a similar analysis method to identify good elements of a post, and find the co-occurance of these keywords, which serve as themes for SGAG's future posts. The results for the MBA can then be combined with past data of these posts (eg engagements per post) for more sophisticated analysis. For our MBA, we defined a transaction to be every SGAG post, with items of the transaction defined as the elements that make up the post. For example, the 'National Event' post in the image below was taken as a transaction, and further categorised into items based on the elements it contains: "Funny face", "Trolling", "Weird public behaviour".
Mba post.png

Hence a typical transaction will look like this:

Transaction ID Items

12345

Trolling, Weird Behaviour, Funny Face, SG50, Exaggeration, NDP


Association rules
Mba association rules.png
From the results of the analysis, we are about to associate the % percentage chance of items occurring together. For example, based on the table, if "NDP" is an element of the post, there is 66.7% that there is also "National Pride". This rule is found in 10.5% of all transactions. This rule also has a lift curve of 2.111, which means that having "NDP" and "National Pride" by a factor of 2.111. These methodology can be used to identify those rules with high confidence and high support.
The heatmap below shows the data in visualisation form:
Mba heatmap2.png

Visualisation of itemsets
Mba item charts.png
Factors for data input for MBA
The accuracy of results generated will be subjected to many factors. To ensure a good model based on the MBA, data has to be properly selected. Below are our recommendations based on that:

Factor Justification

Relevance of Information

A 6/12 month expiry date can be set for the kind of inputs (posts) used in the MBA, as extremely old posts will possibly contain irrelevant content which will not be applicable to current contexts.

Sentiment value

For the conduct of sentiment analysis, to further generate insights on how well a post is received (positive/negative sentiments). For example, a post might have a high number of engagements but have a negative sentiment attached. SGAG will want to limit these.

Virality of post

The amplication rate and what causes it. A measure that can be used is number of shares against impresssions.

Hashtags used

Hashtags can provide insights to the associations this post has with other related topics. These insights can lead to enhanced engagements with audiences that are not currently users of SGAG's platform's, by using hashtags that can appeal to them.

Time period

Analyse certain time periods to find correlations between trends and specific periods of the year. For example, National Day or the month of August generally generates higher engagement figures.

Number of days to and from festivities

Finding the right time to engage audiences based on festivities.

Limitations and Future Work

While using specific APIs like the Facebook and Twitter API help us to get specific data about about each post like its content and attached 'tags', it can prove to be too much for regular computers to handle. This is especially true when analysing posts from multiple categories, where posts from all the categories need to be analysed together. For future work, we can use algorithms such as XGBoost or logistic regression to predict the number of unique visitors per post, based on training data for existing posts and their respective unique visitors.