AY1516 T2 Team AP Analysis PostInterimFindings

From Analytics Practicum
Jump to navigation Jump to search

Team ap home white.png HOME

Team ap overview white.png OVERVIEW

Team ap analysis white.png ANALYSIS

Team ap project management white.png PROJECT MANAGEMENT

Team ap documentation white.png DOCUMENTATION


Data Retrieval & Manipulation (Pre Interim) Pre interim findings Post interim twitter findings Post interim plan Post interim findings

Data Retrieval

Constructing the graph from scratch involved the usage of python code to retrieve posts from SGAG's Facebook account for posts dating back 10 months. This involved connecting to the Facebook graph API programatically to formulate a csv file that resembles this structure:


Each user ID in List of Likers and List of Commenters are separated by a semicolon, and tagged to each post.

Post IDList of LikersList of Commenters
378167172198277_1187053787976274 10206930900524483;1042647259126948;10204920589409318; ... 10153979571077290;955321504523847;1701864973403904; ...
... ... ...

After crawling the Facebook API for ~4.5 Hours, the result is 1600++ posts dating 10 Months ago, with a CSV file size of ~38MB. Entire code can be viewed here.

Code snippet of likers & commenters retrieval
Code snippet of conversion of CSV into GraphML format
Initial import report in GraphML

Subsequently, we wanted to visualize the data using the Gephi tool. Hence, additional python code was used to read the CSV file, programmatically reading each row of the CSV, and attaching each post ID to likers and commenters respectively. This is done so that we can construct the .graphml graph formatted file, which gephi is able to read. Entire code can be viewed here.

The resultant file (~211MB) is uploaded here for reference.

Gephi Analysis

The initial import into gephi constant crashes when using computers with < 8 GBs of RAM. We eventually managed to leverage a computer with 16GB of RAM to begin the initial import of the huge .graphml file. The initial report indicated that we have 296,063 Nodes and 2,142,208 edges.

Upon successful load of the file and running the Force Atlas 2 Algorithm, we achieved the below graph.

Screen Shot 2016-04-09 at 6.15.07 pm.png

As expected, the graph is extremely unreadable, and furthermore, even with a high performance machine, the Gephi application is extremely slow and painful to work with under these circumstances.

Hence, we performed a filter of the the number of degrees on the graph to obtain a condensed version below.

Screen Shot 2016-04-09 at 6.24.36 pm.png

From a high level view of the graph, we noticed that for some clusters(located at the side of the graph) have a mixture of both commenters(Green) and likers(Red), with a large proportion of interaction being "likes". Zooming into the graph, we notice that if the post was forced to isolation by the ForceAtlas algorithm:

Screen Shot 2016-04-09 at 6.17.03 pm.png

However, even though with the filter applied, working with Gephi is extremely laggy and unusable to get concrete findings, and thus we decided to look for alternatives in order to obtain additional insights.

NetworkX approach

To resolve this issue, we looked to Networkx to try and see how we can adapt the graph formation programmatically for us to better study the network.

After studying the code, we noticed there was an error in the initial CSV that we generated for all the posts( with respective likers and commenters). The main reason why this happened was because when we saved the data into Excel, each individual row has the character limit of 35,000, and thus, many of the posts likers/commenters were truncated because each user's ID is about 20characters long, and those particular posts have >1000 Likes or Comments.

Therefore we revised the crawling code that polls the Facebook API. It can be found here.

Similarly, after ~4 hours of crawling, the data is presented here.

In addition to the limitation of Excel in terms of each cell's size, we also notice there was an error as to how we were constructing the GraphML file. We were constructing the graphML file using a Unimodal approach to plot the network, where the users and posts were both modelled as the same nodes and thus not representing our dataset well.

Hence, we now modelled the network as a bipartite graph, with the expectation that it should give us a good representation of the network. We revised the code, and it can be found here.

Before we ran the code, we tried to obtain a small representation of the posts(5) because we expected the network plot to take more than 8 hours to complete. The 5-post bipartite graphml file can be found here.

Loading it into Gephi and running the Yifan Hu algorithm, we get this result:

Screen Shot 2016-04-12 at 9.59.17 pm.png