AY1516 T2 Team AP Data

From Analytics Practicum
Jump to navigation Jump to search

Team ap home white.png HOME

Team ap overview white.png OVERVIEW

Team ap analysis white.png ANALYSIS

Team ap project management white.png PROJECT MANAGEMENT

Team ap documentation white.png DOCUMENTATION

Project Description Data Methodology

Dataset provided by SGAG

Currently, SGAG only uses the insights provided on Facebook Page Insights and SocialBakers to gauge the reception of its posts, and much of the data that they have access to has not been analysed on a deeper level.

They have provided us with social media metric data extracted from its social media platforms, namely Facebook, Twitter and Youtube. This gives us the following datasets that present a generic aggregated representation SGAG's followers:

  • Unique visitors, by day and month
  • Post level insights: Total Impressions, Reach, Feedback
  • Engagement Insights: Likes, Viewed, Commented

This does not assist us directly in mapping out SGAG's social network, and we would have to crawl for more data using the API for each social media platform pertaining to the social network.

Crawling

Initial exploration with NodeXL
Initially, we decided to map out the social networks for SGAG's main platforms: Facebook, Twitter, doing that in the same order. In terms of tools considered for the job, we explored firstly NodeXL, because of the ease of data retrieval. However, due to restrictions imposed on the usage of this tool for the free version, we decided to explore various other options. Our initial exploratory plans with NodeXL are documented below.

Using NodeXL, we are able to extract SGAG's Twitter social network data.

This gives us the following information:

  • Followed/ Following relationship represented by Edges
  • Names of Twitter accounts associated with SGAG and their followers
  • Interactions with SGAG's posts (Favourites, Retweets and Replies)

Due to Twitter API's querying limit, we will have to spend some time requesting for data. We have arranged to do this within 1 week.

After successfully crawling the data, we will load it up into Gelphi and begin our visualisation.

Here is an example of an expected network visualisation for a social media platform.

Expectedvis.png


Settling on Facebook and Twitter API using Python
Following the identified restrictions of NodeXL, we decided to write data crawling scripts using Python instead. Both Twitter and Facebook provide well documented APIs for such purposes, and hence we went along with the plan. However, as we began to crawl the data, first for Twitter and then for Facebook, we began to meet with limitations imposed by each API. A summary of the problems that we faced, as well as how we mitigated these problems, is shown below:

Data crawling problems with Twitter's and Facebook's API

Problem How we mitigated the problem
Twitter API Limits - Maximum number of 15 tweets retrieved per 15-min window

Implemented a 'thread-pause' block in the code to ensure that the crawler waits for 15 mins every time the limit is hit, before continuing the data crawling process

Facebook API Limits - Access token used for data crawling expires after 1-2hours, thus killing the crawling process every time that period of time lapses, effectively preventing us from crawling the entire dataset

Implemented code to extend the lifetime of the access token to a period of 60days

Merging data

Data retrieved from Twitter and Facebook are relatively raw and hard to work with. Hence we followed the following steps to prepare the data for our analysis.

Data preparation for Twitter

  • 1. Crawl JSON data via Twitter API
  • 2. Extract Tweet ids, Tweet content, number of retweets of particular Tweet
  • 3. Use extracted tweet ids to retrieve list of retweeters for this Tweet. Repeat for all Tweets.
  • 4. Merge list of retweeters, by concatenating each retweeter id with a ';'
  • 5. Save data into csv format
  • 6. Manually look through each retrieved Tweet's content and categorise (eg. Army, Education...)


Data preparation for Facebook

  • 1. Crawl JSON data via Facebook Graph API
  • 2. Extract post ids, post content, number of likers of particular post, number of commenters of particular post
  • 3. Use extracted post ids to retrieve list of likers and list of commenters for this post. Repeat for all posts.
  • 4. Merge list of likers, by concatenating each retweeter id with a ';'. Do the same for list of commenters.
  • 5. Save data into csv format
  • 6. Manually look through each retrieved Facebook's content and categorise (eg. Army, Education...)


Data visualisations via Gephi
The processed outputs from above will be used to plot bipartite Gephi social networks, with 2 different types of nodes: (1) Tweet or Post nodes, and connected to (2) User nodes by an edge. The edge represents an interaction or engagement by that user, on a particular post. In the Twitter social network, each edge between (1) and (2) will be a retweet. In the Facebook social network, each edge between (1) and (2) will be an aggregation of likes and comments of that user to that post. Weights of 1 and 2 represent a like and comment respectively. This is because we feel a like is a relatively weaker interaction than a comment, which will be reflected in the network.

Storing data

Our data comes from multiple sources (Twitter and Facebook), and one consideration is the ease of data retrieval after storing the SGAG network data extracted. As such, data storage in a relational database such as MySQL is preferred due to its support of various file format exports. Furthermore, data stored via this method can be easily manipulated and accessed for visualisations and further analysis through external software.

However, as we progressed with the project, we released that we were over-complicating the way data is stored. We decided to simply persist data retrieved in csv files, firstly because csv files are commonly used formats for inputs into software for analysis (eg. Excel, Tableau), and also because of the ease of transferring data from one machine to another. When our team members worked on the data, it is inevitable that the retrieved/processed data sources be passed from one member to another, and this will be a harder task if we used a database like MySQL.