AY1516 T2 Team AP Analysis

From Analytics Practicum
Jump to navigation Jump to search

Team ap home white.png HOME

Team ap overview white.png OVERVIEW

Team ap analysis white.png ANALYSIS

Team ap project management white.png PROJECT MANAGEMENT

Team ap documentation white.png DOCUMENTATION

Data Retrieval & Manipulation, Pre-Interim Pre-Interim Findings Post-Interim Twitter Findings Post-Interim Plan Post-Interim Facebook Findings

Initial Dataset provided by SGAG

During our initial meeting with SGAG, they provided several data files with information regarding their social media accounts. Upon further inspection, we realised that the data provided were largely aggregate data, and even if we attempted to load it into data analysis tools like Gephi/Graphwiz to analyse SGAG's social network, it would not be a correct representation. In addition, the Tweet Activity Metrics could not show how popular each post was to specific users, rendering the data fairly unusable.

Hence, we decided to retrieve the data ourselves from Twitter, in attempt to visualise SGAG's social network that included specific users, instead of aggregated data. We attempted this by leveraging on the Twitter public API, to tailor to our data collection needs.

Twitter API Exploration

As explained in the previous section, our initial dataset was insufficient for our data analysis. Therefore, we decided to retrieve more substantial data directly via Twitter, through their publicly exposed APIs. We researched on various Python libraries suitable for data crawling, exploring wrapper libraries such as Tweepy and python-twitter. We finalised on the use of the Tweepy library, given the huge community support and ease of use.

Introduction
Our exploratory analysis aims to analyse the behaviour of Twitter users at an individual post level- to find out the type of posts that tend to be retweeted, as well as profile the kind of Twitter users that are more prone to retweet them.

Data Crawling
We carried out the process of data collection in 2 steps:
1) Using Tweepy, we retrieved the data in JSON format
2) Parsed the JSON and wrote the data to a csv file.

In Step 1, we utilised the user_timeline() method to retrieve all SGAG posts. For each SGAG post, the retweeters() method is called to retrieve the list of retweeters of that particular post. These posts are then converted to csv format before further data processing is done.

Below are examples of the method usage:

User timeline usage.PNG
Screen name "sgag" is passed as a parameter, to the user_timeline() method, which returns a list of tweets/posts by SGAG.

Retweeters usage.PNG
The "id" of every SGAG post is passed as a parameter to the retweeters() method, which then returns all the pages (lists) of retweeter ids. These ids are then concatenated into a single column value, delimited by ';' into the csv file.

Data Details
From the data retrieved via the Tweepy calls, we identified various variables of interest for further analysis. We were able to retrieve the following variables, which allowed us to better understand how each SGAG post fared:

Variable Name Description

text

Content of the SGAG tweet

id

Id identifier of the SGAG tweet

favorite_count

Number of retweeters of this particular tweet

retweet_count

Number of retweeters of this particular tweet (reported by Twitter)

created_at

Timestamp during which this tweet was made

list_of_retweeter_ids

List of ids of retweeters of this particular tweet

list_of_retweeter_ids_count

Number of ids of retweeters of this particular tweet

Upon deep diving into the dataset retrieved, we found out that the number of retweeters of a particular post as reported by Twitter (and reported as variable "retweet_count") does not correspond to the "list_of_retweeter_ids_count" value, which is the number we retrieved via data crawling. This is because of Twitter's privacy setting policy- which allows Twitter users to select a "protected account" setting, which disallows data to be collected from this particular user. Hence, there is a discrepancy between the numbers collected, with the "list_of_retweeter_ids_count" value being significantly lesser than that of the corresponding "retweet_count" value for each tweet. For the sake of this project, we will focus on the "list_of_retweeter_ids_count" value instead, as it reflects the data of users which we are able to attain via the Twitter/Tweepy APIs.

Data Preparation
Following that, we added a further logical categorisation of each tweet based on it's "text" content. We noticed that every tweet had a particular theme, for instance- some tweets are a parody of current political issues in Singapore, while others discuss school life as we know it. Therefore, we came up with a list of 15 possible categories, inspected each individual tweet and slotted each tweet into one of these:

Categories

  1. Politics
  2. Holiday
  3. Commuting
  4. Interpersonal messages
  5. Army
  6. Sports
  7. School
  8. News satire
  9. Food
  10. Goodwill/Feel good
  11. Weird public behaviour
  12. Weather
  13. Retail experience
  14. National event
  15. Relationships

Limitations

During our data collection phase, we faced issues due to Twitter's REST API call rate limit policies. Rate limits in Twitter are split into 15 minute intervals, with each method call having its own set of limits that restricts the amount of data we can gather within this period of time.

Also, the user_timeline() method can only return up to 3,200 of the most recent tweets of SGAG, which will form the dataset of interest that we work with. Retweets of other statuses by SGAG are also included.