AY1516 T2 Team SkyTrek Methodology

From Analytics Practicum
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

HOME

OVERVIEW

ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION

Project Description Data Methodology


Overview

The following table demonstrates the analytical methods proposed for use, in order to achieve our objectives for this practicum.

Objective Analytical Method(s)

To validate and possibly identify new CT
To validate current allocation of resources to the various CT based on evaluated performance

Content Theme Performance Analysis Via Clustering

To increase organic growth by identifying key article attributes that draw high levels of traffic and interest

Understanding performance of news article attributes based on organic viewership via Logistic Regression

Facilitate the content planning process by way of an interactive dashboard
  • Explore the effectiveness of advertising efforts (paid articles)
  • Investigate the organic growth of articles (non-paid articles)
  • Investigate performance of Content Theme Clustering

Data Visualization


Content Themes Analysis

Skyscanner has classified its collection of 399 news articles into 7 CT. They are Skyscanner Product, Practical/Tips, Deals - Prices, Trending, Domestic/Local, City Guides and Inspirational. It would be useful to know which CT is most well received with readers, hence facilitating a more targeted allocation of article-writing resources. The converse also hold true. This analysis predicates on the validity of the CT provided. Hence, it would be prudent to first validate the truth of these pre-identified CT. In order to address this, we employed the use of clustering analysis.

In the clustering of articles, one would intuitively base the classification on the content covered in the article. However, given that our primary objective was to evaluate the performance of the CT discovered from this clustering, we found it more useful to perform clustering on the article titles. After all, given the swarm of information on the internet and social media, the decision to view the article or not is very much contingent on the appeal of the article title.

This then begs the question: Shouldn’t the clusters generated on the article body be similar to those on the article title? While the CT identified were similar, the examples constituting the clusters were found to be different, leading to different conclusions on the CT performance analysis. There are a few reasons explaining this. The clustering algorithm does not directly emplace the articles into the CT. Therein lies the inherent evil of subjectivity in the profiling and assignment of clusters to new/existing CT. Last but not least, Skyscanner is known to use highly-searched keywords in the naming of their article titles in a bid to drive article performance. This might lead to the forced use of keywords which could differ from the article content. Our analysis will demonstrate the existence of this disparity between CT performance between article title and body. After considering the order of activity flow in a typical decision of whether to read an article or not, we find the the article title to take precedence. Hence, we shall base our recommendations on the article title analysis.

Logistics Regression on Article Characteristics

Logistic regression is different from linear regression and other predictive model, as it requires a binary dependent variable. This is based on an outcome of ‘Success’ and ‘Failure’ modeled as 1 and 0 respectively. The model tries to predict the probability of success or failure of the dependent variable based on the values of the independent variables inputted into the model. In our business case, the goal is to understand the factors that affect the ‘Success’ or ‘Failure’ of a given news article.
After discussions with our client at Skyscanner, we concluded that ‘Success’ of an article depends on how many unique page views it is able to get. If an article ‘Fails’ then we can conclude that it is not worth investing time and effort in creating it. Based on this business rationale, we model our analytical problem to encompass the idea of Success and Failure in Unique Page Views based on other attribute values such as No of Images, No of Links, Article Length etc.

Dependent Variable: Unique Page Views
The client has categorized our dependent variable to take two values:

  • Success: Over 400 Unique Views
  • Failure: Under 400 Unique Views

This implies that if an idea for an article cannot attain a minimum of 400 page views (based on the model), it may not be in the interest of the business to pursue it. After looking at the distributions of the independent variables, we proceeded to remove outliers and also test for collinearity in order to pick the most appropriate attributes for the Logistic Regression. In order to understand the factors that lead to high performance of this article, we used the following attributes in our model inputs as the independent variables:

Predictors:

  1. Average Time on Page
  2. No of Images
  3. No of Links
  4. No of Words
  5. Bounce Rate
  6. Exit Rate
  7. News Content Title Theme (Categorical- from Clustering)

Removed Predictors:

  1. No. of Shares
  2. Sessions
  3. Pageviews
  4. Organic Searches
  5. Published Date

Data Visualization

Source/Medium Dashboard

From the discussions with our sponsor, Skyscanner Content Team would like to perform exploratory analysis on the effectiveness of paid and unpaid articles, across multiple platforms that Skyscanner promotes the content. Thus we have created an interactive dashboard using Tableau software to facilitate this exploration process, based on the data that we have assimilated.

SkyTrek source dashboard.png

Since the attributes of interest for Skyscanner are unique page views (UPV) and average time on page (ATOP), and since Skyscanner also wants to understand which platforms or mediums bring the most satisfactory result, we decided to represent the information using a treemap. Each tile within the map is assigned a platform/medium. The tile’s size shows the relative UPV, with bigger tile meaning that particular source has higher UPV. Similarly, the tile color shows us the relative ATOP, with darker tile having higher ATOP, and vice versa.

Skyscanner also wants to find out more about the organic growth of their articles (non-paid), and in term of paid content, possibly which platforms yield better UPV or ATOP, in order to focus their resources. Thus we also created a filter to separate data between organic and paid articles.

SkyTrek source dashboard filter.png

In order to provide a better drill down view on performance of each article within a particular platform, the right-hand side of the dashboard can show the readings of the articles, based on which platform is selected from the treemap. The result can be sorted to view, for example, top 10 articles with highest UPV that were searched on Google, or top 10 articles that has the most engagement from Facebook. User can also switch between different measures (UPV, ATOP, bounce rate, exit rate, etc.) to evaluate the performance or explore various perspectives.

SkyTrek source dashboard measure.png

Content Theme Visualization

In conjunction with discovering which cluster an article belongs, it is interesting to investigate how each cluster perform against each other across multiple measures, so that Skyscanner can improve their planning process and allocate sufficient resources to the most appropriate content themes.

SkyTrek ct dashboard.png


As mentioned above, we have identified 9 distinguish content themes, and one unclassifiable theme. However, there are some articles which could not be classified under a specific theme, and thus were assigned two, or even three themes. Through the filter on the right hand side, Skyscanner team can easily select and compare a certain measure between chosen CT classifications, which may be hard to realize when selecting all classifications.

SkyTrek ct dashboard filter.png

Our group also include the capability to switch between different measures, such as UPV, ATOP, bounce rate, and exit rate, so that Skyscanner team can dive deeper into the characteristics of each cluster. Certain clusters may drive high UPV, but low ATOP, and vice versa.

SkyTrek ct dashboard measure.png