AY1516 T2 Team AP Analysis PostInterimFindings
Data Retrieval & Manipulation (Pre Interim) | Pre interim findings | Post interim twitter findings | Post interim plan | Post interim findings |
Data Retrieval
Constructing the graph from scratch involved the usage of python code to retrieve posts from SGAG's Facebook account for posts dating back 10 months. This involved connecting to the Facebook graph API programatically to formulate a csv file that resembles this structure:
Each user ID in List of Likers and List of Commenters are separated by a semicolon, and tagged to each post.Post ID | Message | Type | List of Likers | List of Commenters |
378167172198277_1187053787976274 | Got take plane go holiday before, sure got kena one of these! | Link/Photo | 10206930900524483;1042647259126948; ... | 10153979571077290;955321504523847; ... |
... | ... | ... | ... | ... |
After crawling the Facebook API for ~4.5 Hours, the result is 1600++ posts dating 10 Months ago, with a CSV file size of ~38MB. Entire code can be viewed here.
Subsequently, we wanted to visualize the data using the Gephi tool. Hence, additional python code was used to read the CSV file, programmatically reading each row of the CSV, and attaching each post ID to likers and commenters respectively. This is done so that we can construct the .graphml graph formatted file, which gephi is able to read. Entire code can be viewed here.
The resultant file (~211MB) is uploaded here for reference.
High Level Analysis
Before we attempt to dive into the specific data about the users, we performed a generic high level analysis about the number of unique users reached for all the posts dating back 10 months.
As illustrated,
Following that, we also wanted to understand SGAG's performance when benchmarked against its competitors. We identified 2 of which: SMRT Feedback & Mr Brown, and crawled data from their respective Facebook pages. The graph below illustrates their relative performances in terms of number of user engagements/interactions (which consists of the aggregation of the number of likes and comments of posts) in a given time period of April to August, for the years of 2012 through 2015. The reason we focused on the months of April to August is due to the relatively higher number of user engagements in this time period, as compared to the rest of the year, and we attribute this to it coinciding to the duration of the school holiday periods, enabling people to be relatively free-er to engage in social media activities.
Gephi Analysis
The initial import into gephi constant crashes when using computers with < 8 GBs of RAM. We eventually managed to leverage a computer with 16GB of RAM to begin the initial import of the huge .graphml file. The initial report indicated that we have 296,063 Nodes and 2,142,208 edges.
Upon successful load of the file and running the Force Atlas 2 Algorithm, we achieved the below graph.
As expected, the graph is extremely unreadable, and furthermore, even with a high performance machine, the Gephi application is extremely slow and painful to work with under these circumstances.
Hence, we performed a filter of the the number of degrees on the graph to obtain a condensed version below.
From a high level view of the graph, we noticed that for some clusters(located at the side of the graph) have a mixture of both commenters(Green) and likers(Red), with a large proportion of interaction being "likes". Zooming into the graph, we notice that if the post was forced to isolation by the ForceAtlas algorithm:
However, even though with the filter applied, working with Gephi is extremely laggy and unusable to get concrete findings, and thus we decided to look for alternatives in order to obtain additional insights.
NetworkX approach
To resolve this issue, we looked to Networkx to try and see how we can adapt the graph formation programmatically for us to better study the network.
After studying the code, we noticed there was an error in the initial CSV that we generated for all the posts( with respective likers and commenters). The main reason why this happened was because when we saved the data into Excel, each individual row has the character limit of 35,000, and thus, many of the posts likers/commenters were truncated because each user's ID is about 20characters long, and those particular posts have >1000 Likes or Comments.
Therefore we revised the crawling code that polls the Facebook API. It can be found here.
Similarly, after ~4 hours of crawling, the data is presented here.
In addition to the limitation of Excel in terms of each cell's size, we also notice there was an error as to how we were constructing the GraphML file. We were constructing the graphML file using a Unimodal approach to plot the network, where the users and posts were both modelled as the same nodes and thus not representing our dataset well.
Hence, we now modelled the network as a bipartite graph, with the expectation that it should give us a good representation of the network. We revised the code, and it can be found here. To have a distinction between likes and comments, we assigned likes to a weight of 1 and comments to a weight of 2, indicating that a comment represents a "heavier" interaction with the post than simply liking it.
Before we ran the code, we tried to obtain a small representation of the posts(5) because we expected the network plot to take more than 8 hours to complete. The 5-post bipartite graphml file can be found here.
Loading it into Gephi and running the Yifan Hu algorithm, we get this result:
As shown, it is observed that users(blue nodes) that are active in multiple posts(red nodes) are clustered between the large clusters. Also, we notice that each posts popularity varies, as seen from the size of the clusters around each red node. Also, one really interesting thing to note is that there are users who interact with all 5 posts, indicating that these users might be SGAG's biggest fans.
We aim to conduct greater in depth analysis when the full network in constructed(8 Hours wait time).
Gephi Analysis Part 2
The graphml file(~320MB) for the entire network that was generated with the new code can be found here.
Upon load, we have 295,276 nodes and 2,135,622 edges.
Due to SGAG's facebook huge social network, the Gephi analysis process was still very laggy, despite changes to our code to optimise the graph. Hence, we decided to edit the data rows individually in the data laboratory to shrink down the dataset, since we couldn't even apply filters as Gephi was not even responding to it.
In this process, we filtered the dataset to include the top 10 posts with the largest number of degrees, and also the bottom 10 posts with the least number of degrees.
From the shrinked dataset, we executed a ForceAtlas algorithm to spread out the nodes, and obtained the graph below:
From this dataset, we found out that each follower engages with an average of 4.232 posts. To focus on the popular categories of posts, we decided to zoom into "National Event" and "Goodwill/Feel Good".
Longest path from one follower to another follower is 2 users away. This is the same as the Twitter social network graph, which shows that the Goodwill/ Feel Good segment is relatively tight.
We hope to derive a methodology that SGAG can follow in order to identify traits of high-performing posts in each segment.
In the Goodwill/ Feel Good segment, these are the 4 posts:
Post 1: Till death do us part... Read full story: http://mypaper.sg/top-stories/man-95-visited-stricken-wife-daily-40-days-20150618
Engagement: 19,478
Number of Followers that Engaged: 3,925
Post 2: This is the true Singapore spirit we're talking about! Thank you Dr Chew and Dr Tan! Hope Megan recovers soon! Read: http://www.straitstimes.com/singapore/docs-acted-fast-to-get-victim-home-for-treatment#xtor=CS1-10
Engagement: 14,242
Number of Followers that Engaged: 4,726
Post 3:Look who came to join the RSAF Black Knights to give LKY a final salute? #RememberingLeeKuanYew Credits to Malcolm Koh
Engagement: 10,015
Number of Followers that Engaged: 3,665
The number of Followers that engage with these 2 posts exclusively constitute 70.24% of all the Followers that engage with Goodwill/ Feel Good posts.
Post 4:#StartOfLKYQueue as of 11pm is at: - Hong Lim Park - Opp Peninsula Plaza Share your location+time so that people who wants to queue knows where to go!
Total number of Followers in this category: 12,316
Among the 3 Top posts for Goodwill/ Feel Good, we found that there is an exceptionally large number of likers/commenters engaging with 2 of the 3.
We postulate that compared to the third Goodwill/ Feel Good post from the Top 10, which focuses on the symbolism of crows flying together with RSAF Black Knights in commemoration of Lee Kuan Yew’s death, the 2 posts are less abstract and to the point. For the fourth post, it was an informational post rather than a celebration or remembrance, which possibly explains its poor performance. The commonalities in themes we found between the first 2 posts are:
- Going beyond the call of duty
- Love and care for one another
- Personified by actual people doing acts of goodwill
Based on this sample, these are the characteristics that SGAG can look to emulate in their Goodwill/ Feel Good posts to generate high-performing posts.
Limitations and Future Work
While using specific APIs like the Facebook and Twitter API help us to get specific data about about each post like its content and attached 'tags', it can prove to be too much for regular computers to handle. This is especially true when analysing posts from multiple categories, where posts from all the categories need to be analysed together. For future work, we can use algorithms such as XGBoost or logistic regression to predict the number of unique visitors per post, based on training data for existing posts and their respective unique visitors.