AY1516 T2 Team AP Analysis
Data Retrieval & Manipulation, Pre-Interim | Pre-Interim Findings | Post-Interim Twitter Findings | Post-Interim Plan | Post-Interim Facebook Findings |
---|
Initial Dataset provided by SGAG
During our initial meeting with SGAG, they provided several data files with information regarding their social media accounts. Upon further inspection, we realised that the data provided were largely aggregate data, and even if we attempted to load it into data analysis tools like Gephi/Graphwiz to analyse SGAG's social network, it would not be a correct representation. In addition, the Tweet Activity Metrics could not show how popular each post was to specific users, rendering the data fairly unusable.
Hence, we decided to retrieve the data ourselves from Twitter, in attempt to visualise SGAG's social network that included specific users, instead of aggregated data. We attempted this by leveraging on the Twitter public API, to tailor to our data collection needs.
Twitter API Exploration
As explained in the previous section, our initial dataset was insufficient for our data analysis. Therefore, we decided to retrieve more substantial data directly via Twitter, through their publicly exposed APIs. We researched on various Python libraries suitable for data crawling, exploring wrapper libraries such as Tweepy and python-twitter. We finalised on the use of the Tweepy library, given the huge community support and ease of use.
Introduction
Our exploratory analysis aims to analyse the behaviour of Twitter users at an individual post level- to find out the type of posts that tend to be retweeted, as well as profile the kind of Twitter users that are more prone to retweet them.
Data Crawling
We carried out the process of data collection in 2 steps:
1) Using Tweepy, we retrieved the data in JSON format
2) Parsed the JSON and wrote the data to a csv file.
In Step 1, we utilised the user_timeline() method to retrieve all SGAG posts. For each SGAG post, the retweeters() method is called to retrieve the list of retweeters of that particular post. These posts are then converted to csv format before further data processing is done.
Below are examples of the method usage:
Screen name "sgag" is passed as a parameter, to the user_timeline() method, which returns a list of tweets/posts by SGAG.
The "id" of every SGAG post is passed as a parameter to the retweeters() method, which then returns all the pages (lists) of retweeter ids. These ids are then concatenated into a single column value, delimited by ';' into the csv file.
Data Details
From the data retrieved via the Tweepy calls, we identified various variables of interest for further analysis. We were able to retrieve the following variables, which allowed us to better understand how each SGAG post fared:
Variable Name | Description |
---|---|
text |
Content of the SGAG tweet |
id |
Id identifier of the SGAG tweet |
favorite_count |
Number of retweeters of this particular tweet |
retweet_count |
Number of retweeters of this particular tweet (reported by Twitter) |
created_at |
Timestamp during which this tweet was made |
list_of_retweeter_ids |
List of ids of retweeters of this particular tweet |
list_of_retweeter_ids_count |
Number of ids of retweeters of this particular tweet |
Upon deep diving into the dataset retrieved, we found out that the number of retweeters of a particular post as reported by Twitter (and reported as variable "retweet_count") does not correspond to the "list_of_retweeter_ids_count" value, which is the number we retrieved via data crawling. This is because of Twitter's privacy setting policy- which allows Twitter users to select a "protected account" setting, which disallows data to be collected from this particular user. Hence, there is a discrepancy between the numbers collected, with the "list_of_retweeter_ids_count" value being significantly lesser than that of the corresponding "retweet_count" value for each tweet. For the sake of this project, we will focus on the "list_of_retweeter_ids_count" value instead, as it reflects the data of users which we are able to attain via the Twitter/Tweepy APIs.
Data Preparation
Following that, we added a further logical categorisation of each tweet based on it's "text" content. We noticed that every tweet had a particular theme, for instance- some tweets are a parody of current political issues in Singapore, while others discuss school life as we know it. Therefore, we came up with a list of 15 possible categories, inspected each individual tweet and slotted each tweet into one of these:
Categories
- Politics
- Holiday
- Commuting
- Interpersonal messages
- Army
- Sports
- School
- News satire
- Food
- Goodwill/Feel good
- Weird public behaviour
- Weather
- Retail experience
- National event
- Relationships
Limitations
During our data collection phase, we faced issues due to Twitter's REST API call rate limit policies. Rate limits in Twitter are split into 15 minute intervals, with each method call having its own set of limits that restricts the amount of data we can gather within this period of time.
Also, the user_timeline() method can only return up to 3,200 of the most recent tweets of SGAG, which will form the dataset of interest that we work with. Retweets of other statuses by SGAG are also included.