Difference between revisions of "AY1516 T2 Team SkyTrek Analysis"

From Analytics Practicum
Jump to navigation Jump to search
m
 
(14 intermediate revisions by 2 users not shown)
Line 19: Line 19:
  
  
<!--------------- Overview Sub Header Start ---------------------->
+
==<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Content Theme Analysis</strong></font></div></div>==
{| style="background-color:#ffffff; margin: 3px auto 0 auto" width="55%"
+
<p>
|-  
+
We had previously highlighted the 7 Content Themes (CT) Skyscanner believes its articles belong to. The aim of this analysis is 3 fold. To validate if these 7 CT are representative of the article content being written. To identify the top 3 CT with the greatest yield. Lastly, to understand the performance across each CT. As mentioned earlier, we will be measuring yield and performance by the metrics UVP and ATOP.
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[AY1516 T2 Team SkyTrek_Data| <span style="color:#3d3d3d">Data</span>]]
+
</p>
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
+
<p>
 +
It was not going to be possible to read each and every single article in order to identify the various CT, hence verifying our client list of CT. Hence, we would employ the use of the K-means clustering algorithm to identify the latent groups of CT within our dataset.
 +
</p>
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Preparing the Dataset</strong></font></div></div>===
 +
<p>
 +
Our database contains the html for each of the 399 articles hosted on Skyscanner Singapore’s travel news site. RapidMiner was used to clean this data. HTML tags were removed from the html content, leaving only the article content. The content was then tokenized, transformed to lowercase, filtered for stop words from the English dictionary, then filtered for tokens with character length between 3 and 41. Following which, a tf-idf matrix was generated for each every token in each article. Tf-idf was used because it accentuates the value of rare word in distinguishing an article from another, thereby augmenting our goal of discovering the latent CT.
 +
</p>
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Applying K-Means Clustering and Discovering more CT</strong></font></div></div>===
 +
<p>
 +
In clustering, we seek to reduce the intra-cluster distance while maximizing the inter-cluster distance. The Davies Bouldin Index (DB) captures this information, with the ideal being a lower value.
 +
We see a general improvement in DB as the number of clusters (K) increases. From 70 clusters onward, this improvement starts to taper off significantly.
 +
This analysis is based on the article text content.  Similar analysis was done on the article titles, and K = 20 was found to be a good value.
 +
</p>
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Validating the Representativeness of the 7 Identified CT</strong></font></div></div>===
 +
<p>
 +
In the course of doing so, we found all 7 CT to be represented in the article. However, we felt it appropriate to generate 2 new CT, ‘Activity/topic discussion’ and ‘Food’ since they represented 17% and 8% of articles respectively. The following pie chart shows the proportion of each CT.<br>
 +
 
 +
[[File:Proportion of articles in each ct.png|frameless| 900px]]
 +
<br>
 +
City Guides, Inspirational and Activity/topic discussion are the top 3 most represented CT at the moment.
 +
</p>
 +
 
 +
 
 +
<p>
 +
There was a discrepancy in the CT allocation after we analyzed the results of both article text and article title.
 +
Upon review of the series of decisions a reader makes in determining whether to invest time in reading the article, we have found article titles to have greater influence. If the title does not incite the reader to read the article, there would be no metrics to gather. Hence, we will focus ensuing recommendations based on the article title clustering. <br>
 +
 
 +
[[File: Skytrek_CT_DocCount.png|frameless|900px]]
 +
</p>
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Identifying the Top Performing 3 CT</strong></font></div></div>===
 +
<p>
 +
Under UPV, City Guides, Trending and Food are ranked highest in descending order.
 +
Under ATOP, Inspirational, Food and Trending are ranked highest in descending order.
 +
It is interesting to note that only Food made it to the top 3 place across both metrics. The client has expressed her opinion that strong performance in both these fields would be a good indicator of high quality traffic since it is both able to get high viewership rates as well as sustain viewer interest in reading the article. Thus, Skyscanner might choose to focus its efforts on writing food articles.
 +
</p>
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Understanding each CT Performance (Z score)</strong></font></div></div>===
 +
<p>
 +
We used the z-score to evaluate the relative performance of each CT against one another, represented in Figure 4 below. Taking the mean and the baseline for comparison, it is interesting to note that a CT that fares well in one metric tends to do badly in the other.  ‘Food’, ‘Product’, ‘Trending’, ‘Domestic/Local’, ‘City Guides’ and ‘Inspirational’ are such CT. This negative correlation between UPV and ATOP is also represented in the correlation analysis done in Figure 5 below.
 +
</p>
 +
<p>
 +
We would recommend Skyscanner direct more resources to the Food CT in view of its strong metric performance. Conversely, we would recommend to avoid the Product CT in view of its weak performance. However, if this was purposefully done for product brand awareness, another CT to avoid would be Inspirational Topics.
 +
</p>
 +
 
 +
[[File:Skytrek_CT_zscore.png|frameless| 900px]]
 +
 
 +
==<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Article Attributes Analysis</strong></font></div></div>==
 +
<p>
 +
The aim of this analysis is to understanding performance of news article attributes based on organic viewership via Logistic Regression.
 +
</p>
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Splitting Data</strong></font></div></div>===
 +
<p>
 +
One important consideration while building a predictive model such as a logistic regression model is to split your data into train and test data so that you can evaluate the model. RapidMiner provides a node to do this splitting based on the user input. The two main decisions to be made are:
 +
# Partition Ratio
 +
# Sampling Type
 +
</p>
 +
 
 +
====<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Partition Ratio</strong></font></div></div>====
 +
<p>
 +
The user, after analyzing the size and nature of the dataset can decide the partition ratio.  There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.<br>
 +
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive). One good practice is to try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. This should see both greater performance with more data, but also lower variance across the different random samples. This is the approach we have considered for our dataset.</p>
 +
 
 +
====<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Sampling Type</strong></font></div></div>====
 +
 
 +
<p>RapidMiner offers three different options for sampling of the data when dividing into partitions. Each one has its own pros and cons depending on the nature of the dataset and goal of the analysis.
 +
Linear sampling simply divides the data into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
 +
Shuffled sampling builds random subsets of the data. Examples are chosen randomly for making subsets.<br>
 +
Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole dataset. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of the class labels. <br>
 +
We proceeded to use stratified sampling with equal proportions of values of the target variable- Unique Page Views being put in both the test and train data. This ensured that both the test and train sets contain the same proportion of ‘Success’ and ‘Failure’ values. </p>
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Converting Nominal Attributes to numerical</strong></font></div></div>===
 +
<p>
 +
The Nominal to Numerical operator is used for changing the type of non-numeric attributes to a numeric type. In logistic regression, this is required to convert the categorical attributes into dummy values. In general for any attribute with ‘k’ values, there will be ‘k-1’ dummy variables. This operator provides many options for recoding for numerical variables into categorical. Below we can see all the options provided in RapidMiner:
 +
* '''unique_integers''': If this option is selected, the values of nominal attributes can be seen as equally ranked, therefore the nominal attribute will simply be turned into a real valued attribute, the old values result in equidistant real values.
 +
* '''dummy_coding''': If this option is selected, for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to 0. Note that the comparison group is an optional parameter with 'dummy coding'. If no comparison group is defined, in every example the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. In this case, there will be no example where all new attributes get value 0.
 +
* '''effect_coding''': If this option is selected; for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to -1.
 +
 
 +
For the purposes of our analysis, we select the ‘Dummy Coding’ as it is a required input for the logistic regression. This models every variable into a column with a binary value, indicating the presence or absence of the given category.
 +
</p>
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Installing the WEKA Extension in RapidMiner</strong></font></div></div>===
 +
 
 +
<p>As the goal of our analysis is to better understand the effects of the different attributes on the dependent variables, we want to be able to interpret the coefficients and odds ratios of the regression. The default logistic regression operator in RapidMiner does not provide the odds ratios and probability outputs of regression and hence we must install the WEKA package add on to RapidMiner in order to run the traditional logistic regression that is similar to what is found in SAS Enterprise Miner.</p>
 +
[[File: Skytrek_LR_weka.png|frameless|900px]]
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Discretization of continuous variables</strong></font></div></div>===
 +
 
 +
<p>While the original model results do solve the issue of having attribute value specific coefficients in the form of Odds Ratio, there is still a problem with our numerical attributes. The numerical values Odds Ratios do not provide us with sufficient insights that can be interpreted due to their incremental nature. In making recommendations based on article length, it would not be meaningful to say that articles with 1 word more would increase success rates by 1 percent. With no upper or lower bound, the client would be inclined to create lengthy articles. Rather, recommending that articles of lengths ranging from 900-1200 fare better than 2000-2500 words would be more meaningful. In order to do this we need to divide the numerical variable into ranges so that we can compare the performance of each range in order to provide a recommendation. In order to do this, we consider the process of discretization.</p>
 +
 
 +
<p>We have considered Discretization by Frequency as it provides us with the highest model accuracy and the lowest squared error. It is also a good fit for our overall business problem.
 +
We have considered 3 partitions per numerical value with the rationale that each partition should have a minimum of over 100 data points. After running the discretization operator we get our output in the form of ranges. We have then plotted the Average Unique Page Views of each partition for a given independent variable. These values have been visualized in tableau and can be found in the image below.</p>
 +
 
 +
[[File: Skytrek_LR_binning.png|frameless|900px]]
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>RapidMiner model</strong></font></div></div>===
 +
[[File: Skytrek_LR_rapidminer.png|frameless|900px]]
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Result</strong></font></div></div>===
 +
[[File: Skytrek_LR_result.png|frameless|900px]]
 +
 
 +
==<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica; border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Software Tools Assessment</strong></font></div></div>==
 +
<p>
 +
We considered the pros and cons of 3 data analysis tools, namely RapidMiner, SAS Enterprise Miner (EM) and R. While EM offers greater visualization capabilities it is not necessary for our analysis. Considering that we only had 399 articles to deal with, each K-means clustering run (most expensive analysis) averages at a mere 7 minutes. Hence, it would not be reasonable to expect the client to invest in a license.
 +
</p>
 +
<p>
 +
On top of the benefits of RapidMiner being a free open-source software, it has a wide selection of operators available for immediate and relevant use in our project. More specifically, they can be used for the ETL process in preparation for K-means clustering. RapidMiner is identified to be capable of accomplishing our project objectives at a lower learning curve and at no monetary cost, hence selected as our tool of choice.
 +
</p>
 +
 
 +
{| class="wikitable" width="50%"
 +
|-
 +
!| Software !! Advantages !! Disadvantages
 +
|-
 +
| RapidMiner
 +
||
 +
* Fully sufficient ecosystem for complete ETL process.
 +
* The drawing board is editable during the runtime. Hence you can prepare a new experiment while RapidMiner is still performing calculations on the last experiment.
 +
* Open source software with many plugin options for extension of feature base.
 +
||
 +
*Too easy to setup the flows/operators that they would take eternity to calculate hence leading to loops.
 +
*Have to rerun whole flow even for a small refinement.
 +
*You can't stop running a node. You can only tell RapidMiner to prevent execution of subsequent nodes. Hence forced application restarts are common.
 +
 
 +
|-
 +
|
 +
SAS Enterprise Miner
 +
||
 +
*Nice interactive visualizations. Whenever you click on a label, the corresponding data in the chart gets highlighted.
 +
*Selective execution in EM is a great feature, which accelerates development, because if you make a mistake at the end of the flow, you don't have to recalculate everything from the scratch
 +
||
 +
*Uninformative error messages. You often have to blindly test many different things until you find the root of the problem.
 +
*Commercial software hence, high cost of setup and service
 +
 
 +
|-
 +
|R
 +
||
 +
*Rich functionalities, packages and development tools especially for statistical analysis.
 +
*Flexible and easy to extend with unmatched charting capabilities
 +
*Open source software
 +
||
 +
*Issues related to security, speed, efficiency and memory management since it emanates from languages built in the 1960s
 +
*Steep learning curve and disorganized documentation
  
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="150px"| [[AY1516 T2 Team SkyTrek_Methodology| <span style="color:#3d3d3d">Methodology</span>]]
 
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
 
 
|}
 
|}
<!--------------- Overview Sub Header End ---------------------->
 

Latest revision as of 22:19, 17 April 2016

HOME

OVERVIEW

ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION


Content Theme Analysis

We had previously highlighted the 7 Content Themes (CT) Skyscanner believes its articles belong to. The aim of this analysis is 3 fold. To validate if these 7 CT are representative of the article content being written. To identify the top 3 CT with the greatest yield. Lastly, to understand the performance across each CT. As mentioned earlier, we will be measuring yield and performance by the metrics UVP and ATOP.

It was not going to be possible to read each and every single article in order to identify the various CT, hence verifying our client list of CT. Hence, we would employ the use of the K-means clustering algorithm to identify the latent groups of CT within our dataset.

Preparing the Dataset

Our database contains the html for each of the 399 articles hosted on Skyscanner Singapore’s travel news site. RapidMiner was used to clean this data. HTML tags were removed from the html content, leaving only the article content. The content was then tokenized, transformed to lowercase, filtered for stop words from the English dictionary, then filtered for tokens with character length between 3 and 41. Following which, a tf-idf matrix was generated for each every token in each article. Tf-idf was used because it accentuates the value of rare word in distinguishing an article from another, thereby augmenting our goal of discovering the latent CT.

Applying K-Means Clustering and Discovering more CT

In clustering, we seek to reduce the intra-cluster distance while maximizing the inter-cluster distance. The Davies Bouldin Index (DB) captures this information, with the ideal being a lower value. We see a general improvement in DB as the number of clusters (K) increases. From 70 clusters onward, this improvement starts to taper off significantly. This analysis is based on the article text content. Similar analysis was done on the article titles, and K = 20 was found to be a good value.

Validating the Representativeness of the 7 Identified CT

In the course of doing so, we found all 7 CT to be represented in the article. However, we felt it appropriate to generate 2 new CT, ‘Activity/topic discussion’ and ‘Food’ since they represented 17% and 8% of articles respectively. The following pie chart shows the proportion of each CT.
Proportion of articles in each ct.png
City Guides, Inspirational and Activity/topic discussion are the top 3 most represented CT at the moment.


There was a discrepancy in the CT allocation after we analyzed the results of both article text and article title. Upon review of the series of decisions a reader makes in determining whether to invest time in reading the article, we have found article titles to have greater influence. If the title does not incite the reader to read the article, there would be no metrics to gather. Hence, we will focus ensuing recommendations based on the article title clustering.
Skytrek CT DocCount.png

Identifying the Top Performing 3 CT

Under UPV, City Guides, Trending and Food are ranked highest in descending order. Under ATOP, Inspirational, Food and Trending are ranked highest in descending order. It is interesting to note that only Food made it to the top 3 place across both metrics. The client has expressed her opinion that strong performance in both these fields would be a good indicator of high quality traffic since it is both able to get high viewership rates as well as sustain viewer interest in reading the article. Thus, Skyscanner might choose to focus its efforts on writing food articles.

Understanding each CT Performance (Z score)

We used the z-score to evaluate the relative performance of each CT against one another, represented in Figure 4 below. Taking the mean and the baseline for comparison, it is interesting to note that a CT that fares well in one metric tends to do badly in the other. ‘Food’, ‘Product’, ‘Trending’, ‘Domestic/Local’, ‘City Guides’ and ‘Inspirational’ are such CT. This negative correlation between UPV and ATOP is also represented in the correlation analysis done in Figure 5 below.

We would recommend Skyscanner direct more resources to the Food CT in view of its strong metric performance. Conversely, we would recommend to avoid the Product CT in view of its weak performance. However, if this was purposefully done for product brand awareness, another CT to avoid would be Inspirational Topics.

Skytrek CT zscore.png

Article Attributes Analysis

The aim of this analysis is to understanding performance of news article attributes based on organic viewership via Logistic Regression.

Splitting Data

One important consideration while building a predictive model such as a logistic regression model is to split your data into train and test data so that you can evaluate the model. RapidMiner provides a node to do this splitting based on the user input. The two main decisions to be made are:

  1. Partition Ratio
  2. Sampling Type

Partition Ratio

The user, after analyzing the size and nature of the dataset can decide the partition ratio. There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive). One good practice is to try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. This should see both greater performance with more data, but also lower variance across the different random samples. This is the approach we have considered for our dataset.

Sampling Type

RapidMiner offers three different options for sampling of the data when dividing into partitions. Each one has its own pros and cons depending on the nature of the dataset and goal of the analysis. Linear sampling simply divides the data into partitions without changing the order of the examples i.e. subsets with consecutive examples are created. Shuffled sampling builds random subsets of the data. Examples are chosen randomly for making subsets.
Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole dataset. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of the class labels.
We proceeded to use stratified sampling with equal proportions of values of the target variable- Unique Page Views being put in both the test and train data. This ensured that both the test and train sets contain the same proportion of ‘Success’ and ‘Failure’ values.

Converting Nominal Attributes to numerical

The Nominal to Numerical operator is used for changing the type of non-numeric attributes to a numeric type. In logistic regression, this is required to convert the categorical attributes into dummy values. In general for any attribute with ‘k’ values, there will be ‘k-1’ dummy variables. This operator provides many options for recoding for numerical variables into categorical. Below we can see all the options provided in RapidMiner:

  • unique_integers: If this option is selected, the values of nominal attributes can be seen as equally ranked, therefore the nominal attribute will simply be turned into a real valued attribute, the old values result in equidistant real values.
  • dummy_coding: If this option is selected, for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to 0. Note that the comparison group is an optional parameter with 'dummy coding'. If no comparison group is defined, in every example the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. In this case, there will be no example where all new attributes get value 0.
  • effect_coding: If this option is selected; for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to -1.

For the purposes of our analysis, we select the ‘Dummy Coding’ as it is a required input for the logistic regression. This models every variable into a column with a binary value, indicating the presence or absence of the given category.

Installing the WEKA Extension in RapidMiner

As the goal of our analysis is to better understand the effects of the different attributes on the dependent variables, we want to be able to interpret the coefficients and odds ratios of the regression. The default logistic regression operator in RapidMiner does not provide the odds ratios and probability outputs of regression and hence we must install the WEKA package add on to RapidMiner in order to run the traditional logistic regression that is similar to what is found in SAS Enterprise Miner.

Skytrek LR weka.png

Discretization of continuous variables

While the original model results do solve the issue of having attribute value specific coefficients in the form of Odds Ratio, there is still a problem with our numerical attributes. The numerical values Odds Ratios do not provide us with sufficient insights that can be interpreted due to their incremental nature. In making recommendations based on article length, it would not be meaningful to say that articles with 1 word more would increase success rates by 1 percent. With no upper or lower bound, the client would be inclined to create lengthy articles. Rather, recommending that articles of lengths ranging from 900-1200 fare better than 2000-2500 words would be more meaningful. In order to do this we need to divide the numerical variable into ranges so that we can compare the performance of each range in order to provide a recommendation. In order to do this, we consider the process of discretization.

We have considered Discretization by Frequency as it provides us with the highest model accuracy and the lowest squared error. It is also a good fit for our overall business problem. We have considered 3 partitions per numerical value with the rationale that each partition should have a minimum of over 100 data points. After running the discretization operator we get our output in the form of ranges. We have then plotted the Average Unique Page Views of each partition for a given independent variable. These values have been visualized in tableau and can be found in the image below.

Skytrek LR binning.png

RapidMiner model

Skytrek LR rapidminer.png

Result

Skytrek LR result.png

Software Tools Assessment

We considered the pros and cons of 3 data analysis tools, namely RapidMiner, SAS Enterprise Miner (EM) and R. While EM offers greater visualization capabilities it is not necessary for our analysis. Considering that we only had 399 articles to deal with, each K-means clustering run (most expensive analysis) averages at a mere 7 minutes. Hence, it would not be reasonable to expect the client to invest in a license.

On top of the benefits of RapidMiner being a free open-source software, it has a wide selection of operators available for immediate and relevant use in our project. More specifically, they can be used for the ETL process in preparation for K-means clustering. RapidMiner is identified to be capable of accomplishing our project objectives at a lower learning curve and at no monetary cost, hence selected as our tool of choice.

Software Advantages Disadvantages
RapidMiner
  • Fully sufficient ecosystem for complete ETL process.
  • The drawing board is editable during the runtime. Hence you can prepare a new experiment while RapidMiner is still performing calculations on the last experiment.
  • Open source software with many plugin options for extension of feature base.
  • Too easy to setup the flows/operators that they would take eternity to calculate hence leading to loops.
  • Have to rerun whole flow even for a small refinement.
  • You can't stop running a node. You can only tell RapidMiner to prevent execution of subsequent nodes. Hence forced application restarts are common.

SAS Enterprise Miner

  • Nice interactive visualizations. Whenever you click on a label, the corresponding data in the chart gets highlighted.
  • Selective execution in EM is a great feature, which accelerates development, because if you make a mistake at the end of the flow, you don't have to recalculate everything from the scratch
  • Uninformative error messages. You often have to blindly test many different things until you find the root of the problem.
  • Commercial software hence, high cost of setup and service
R
  • Rich functionalities, packages and development tools especially for statistical analysis.
  • Flexible and easy to extend with unmatched charting capabilities
  • Open source software
  • Issues related to security, speed, efficiency and memory management since it emanates from languages built in the 1960s
  • Steep learning curve and disorganized documentation