Difference between revisions of "AY1516 T2 Team AP Data"

From Analytics Practicum
Jump to navigation Jump to search
Line 48: Line 48:
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Crawling</strong></font></div></div>==
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Crawling</strong></font></div></div>==
  
<p>The data provided by Skyscanner is what they currently rely on to determine their content planning. In addition to the parameters mentioned above, our group also want to track various other characteristics of the articles, namely:<br>
+
<p> As the data provided by SGAG doesn't allow us to map their social network, we need to crawl the data through Twitter and Instagram API.</p>
* Number of words (stop words removed)
 
* Number of outbound links references
 
* Number of images
 
* Number of videos
 
* Number of article shares
 
</p>
 
<p>The rationale for choosing these attributes will be mentioned in the methodology section of our proposal.</p>
 
<p>These attributes are not available for us. Thus we need to manually crawl the data from the website and merge it with the provided data set.
 
We noticed that Skyscanner website use Javascript to add content to the DOM when the HTML finishes loading. Because of that, normal crawlers without ability to execute Javascript code will not be able to crawl for data within the page after the DOM is modified.
 
</p>
 
<p>After some research, our group decided to employ the use of a headless browser, namely PhantomJS. It allows for Javascript code execution, DOM access, and programmatically interaction with websites without opening any real web browser. Another option is Selenium, but it needs to proxy through a browser installed on our machine.<br>
 
Looking through the DOM structure of  Skyscanner article, we found that the information we need is easily accessible. For example, the main article content is nested in a div block with CSS class “main-content”, and links to recommended other articles are provided in another div block with CSS class “addthis_recommended_vertical”. The repeating structure makes it easy for us to write code and scrap data from the website without too much trouble.<br>
 
<!-- INSERT DOM STRUCTURE HERE -->
 
After successfully scraping the DOM data, we can clear out HTML tags easily using a regular expression <strong>/<(\/|).+?>/g</strong>, then proceed to compute the necessary attributes that we want to collect.
 
</p>
 
  
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Merging data</strong></font></div></div>==
 
==<div style="background: #232AE8; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Merging data</strong></font></div></div>==

Revision as of 01:44, 17 January 2016

HOME

OVERVIEW

ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION

Project Description Data Methodology

Dataset provided by Skyscanner

Currently, SGAG only uses the insights provided on Facebook Page Insights and SocialBakers to gauge the reception of its posts, and much of the data that they have access to has not been analysed on a deeper level.

They have provided us with social media metric data extracted from its social media platforms, namely Facebook, Twitter and Youtube. This gives us the following datasets that present a generic aggregated representation SGAG's followers:

  • Unique visitors, by day and month
  • Post level insights: Total Impressions, Reach, Feedback
  • Engagement Insights: Likes, Viewed, Commented

This does not assist us directly in mapping out SGAG's social network, and we would have to crawl for more data using the API for each social media platform pertaining to the social network.

Crawling

As the data provided by SGAG doesn't allow us to map their social network, we need to crawl the data through Twitter and Instagram API.

Merging data

Since the data is provided for each URL, we can easily match the URL between the data given by Skyscanner and the characteristics crawled by us. Thus, we will have a list of attributes mapped to URLs of each specific article.

Storing data

Our data needs to be saved in a convenient format so that we can use it as input for other analytic programs.

An option for fast querying is storing the data in a database. This approach provides easy export to other formats that can work with analytic software, and access from both a GUI and code. Another option is to store data in flat files for easy transport between systems. However, it will reduce accessibility since our code and program need to parse the information again.

With pros and cons in mind, we will proceed with the database approach initially, and make changes as the the project continues.