Difference between revisions of "AY1516 T2 Team AP Data"

From Analytics Practicum
Jump to navigation Jump to search
Line 53: Line 53:
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Crawling</strong></font></div></div>==
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Crawling</strong></font></div></div>==
  
Initially, we thought of mapping out the social networks for SGAG's main platforms: Facebook, Twitter and Instagram. However, due to the inaccessibility of user data that can be extracted from Facebook, we decided to focus on Twitter and Instagram first since we are able to extract social network data much more easily.
+
<b> Initial exploration with NodeXL </b>
 +
Initially, we decided to map out the social networks for SGAG's main platforms: Facebook, Twitter, doing that in the same order. In terms of tools considered for the job, we explored firstly NodeXL, because of the ease of data retrieval. However, due to restrictions imposed on the usage of this tool for the free version, we decided to explore various other options. Our initial exploratory plans with NodeXL are documented below.
  
We will have to crawl the data through Twitter and Instagram API. Using [http://nodexl.codeplex.com/ NodeXL], we are able to extract SGAG's Twitter social network data.<br>
+
We will have to crawl the data through Twitter API. Using [http://nodexl.codeplex.com/ NodeXL], we are able to extract SGAG's Twitter social network data.<br>
  
 
This gives us the following information:
 
This gives us the following information:
Line 62: Line 63:
 
* Interactions with SGAG's posts (Favourites, Retweets and Replies)
 
* Interactions with SGAG's posts (Favourites, Retweets and Replies)
  
Due to Twitter and Instagram's API's querying limit, we will have to spend some time requesting for data. We have arranged to do this within 1 week. <br>
+
Due to Twitter API's querying limit, we will have to spend some time requesting for data. We have arranged to do this within 1 week. <br>
  
 
After successfully crawling the data, we will load it up into Gelphi and begin our visualisation.<br>
 
After successfully crawling the data, we will load it up into Gelphi and begin our visualisation.<br>
Line 68: Line 69:
 
Here is an example of an expected network visualisation for a social media platform.
 
Here is an example of an expected network visualisation for a social media platform.
 
[[File: expectedvis.png|600px|center]]
 
[[File: expectedvis.png|600px|center]]
 +
<br/>
 +
<b> Settling on Facebook and Twitter API using Python </b>
  
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Merging data</strong></font></div></div>==
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Merging data</strong></font></div></div>==

Revision as of 20:49, 17 April 2016

Team ap home white.png HOME

Team ap overview white.png OVERVIEW

Team ap analysis white.png ANALYSIS

Team ap project management white.png PROJECT MANAGEMENT

Team ap documentation white.png DOCUMENTATION

Project Description Data Methodology

Dataset provided by SGAG

Currently, SGAG only uses the insights provided on Facebook Page Insights and SocialBakers to gauge the reception of its posts, and much of the data that they have access to has not been analysed on a deeper level.

They have provided us with social media metric data extracted from its social media platforms, namely Facebook, Twitter and Youtube. This gives us the following datasets that present a generic aggregated representation SGAG's followers:

  • Unique visitors, by day and month
  • Post level insights: Total Impressions, Reach, Feedback
  • Engagement Insights: Likes, Viewed, Commented

This does not assist us directly in mapping out SGAG's social network, and we would have to crawl for more data using the API for each social media platform pertaining to the social network.

Crawling

Initial exploration with NodeXL Initially, we decided to map out the social networks for SGAG's main platforms: Facebook, Twitter, doing that in the same order. In terms of tools considered for the job, we explored firstly NodeXL, because of the ease of data retrieval. However, due to restrictions imposed on the usage of this tool for the free version, we decided to explore various other options. Our initial exploratory plans with NodeXL are documented below.

We will have to crawl the data through Twitter API. Using NodeXL, we are able to extract SGAG's Twitter social network data.

This gives us the following information:

  • Followed/ Following relationship represented by Edges
  • Names of Twitter accounts associated with SGAG and their followers
  • Interactions with SGAG's posts (Favourites, Retweets and Replies)

Due to Twitter API's querying limit, we will have to spend some time requesting for data. We have arranged to do this within 1 week.

After successfully crawling the data, we will load it up into Gelphi and begin our visualisation.

Here is an example of an expected network visualisation for a social media platform.

Expectedvis.png


Settling on Facebook and Twitter API using Python

Merging data

The Tweet ID provided by SGAG per tweet will be mapped with the crawled data above, and used to plot networks that link each tweet with retweets, replies, likes, etc.

NodeXL provides easy importing of Twitter network data. The imported data will then be prepared and cleaned in the following ways through the merging of duplicate edges to reduce data noise, and grouping of nodes via a cluster algorithm. Metrics and graphs of the network will also be generated.

Storing data

Our data comes from multiple sources (Twitter and Facebook), and one consideration is the ease of data retrieval after storing the SGAG network data extracted. As such, data storage in a relational database such as MySQL is preferred due to its support of various file format exports. Furthermore, data stored via this method can be easily manipulated and accessed for visualisations and further analysis through external software.