AY1516 T2 Team AP Analysis

From Analytics Practicum
Jump to navigation Jump to search

HOME

OVERVIEW

ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION

Data Retrieval Data Manipulation Findings

Initial Dataset provided by SGAG

During our initial meeting with SGAG, they provided several data files with information regarding their social media accounts. Upon further inspection, we realised that the data provided were largely aggregate data, and even if we attempted to load it into data analysis tools like Gephi/Graphwiz to analyse SGAG's social network, it would not be a correct representation. In addition, the Tweet Activity Metrics could not show how popular each post was to specific users, rendering the data fairly unusable.

Hence, we decided to retrieve the data ourselves from Twitter, in attempt to visualise SGAG's social network that included specific users, instead of aggregated data. We attempted this by leveraging on the Twitter public API, to tailor to our data collection needs.

Twitter API Exploration

As explained in the previous section, our initial dataset was insufficient for our data analysis. Therefore, we decided to retrieve more substantial data directly via Twitter, through their publicly exposed APIs. We researched on various Python libraries suitable for data crawling, exploring wrapper libraries such as Tweepy and python-twitter. We finalised on the use of the Tweepy library, given the huge community support and ease of use.

Methodology
Our exploratory analysis aims to analyse the behaviour of Twitter users at an individual post level- to find out the type of posts that tend to be retweeted, as well as profile the kind of Twitter users that are more prone to retweet them.

We carried out the process of data collection in 2 steps:
1) Using Tweepy, we retrieved the data in JSON format
2) Parsed the JSON and wrote the data to a csv file.

In Step 1, we utilised the user_timeline() method to retrieve all SGAG posts. For each SGAG post, the retweeters() method is called to retrieve the list of retweeters of that particular post. These posts are then converted to csv format before further data processing is done.

Below are examples of the method usage:

User timeline usage.PNG
Screen name "sgag" is passed as a parameter, to the user_timeline() method, which returns a list of tweets/posts by SGAG.

Retweeters usage.PNG
The "id" of every SGAG post is passed as a parameter to the retweeters() method, which then returns all the pages (lists) of retweeter ids. These ids are then concatenated into a single column value, delimited by ';', into the csv file.