Difference between revisions of "AY1516 T2 Team SkyTrek Analysis"

From Analytics Practicum
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 72: Line 72:
 
[[File:Skytrek_CT_zscore.png|frameless| 900px]]
 
[[File:Skytrek_CT_zscore.png|frameless| 900px]]
  
==<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Organic Articles Dataset</strong></font></div></div>==
+
==<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Article Attributes Analysis</strong></font></div></div>==
 
<p>
 
<p>
In order to understand the relationship between the target variables and other attributes of the dataset, we have considered running a regression analysis with UPVs and ATOM as the dependant variables. This method would be appropriate as the goal here is to understand the factors that can help predict the performance of a news article.Since most of the data is numerical in nature regression is an appropriate modelling method that will help determine the incremental impact of a unit increase in one variable on the target variables that define performance. Before conducting this, we ran a correlation analysis to check the relationship of the variables with each other, in order to prevent multicollinearity.
+
The aim of this analysis is to understanding performance of news article attributes based on organic viewership via Logistic Regression.
 
</p>
 
</p>
  
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Correlation Analysis</strong></font></div></div>===
+
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Splitting Data</strong></font></div></div>===
 
 
[[File:Article Attributes and Metrics Correlation Analysis.png|frameless| 900px]]
 
<br>
 
 
<p>
 
<p>
Given this correlation matrix, some of the variables that are highly correlated can be removed, in order to prevent multicollinearity. Some examples of these are unique page views, page views and sessions. Since these are highly correlated, we can drop sessions and page views from the dataset for the regression model and use only unique page views.  
+
One important consideration while building a predictive model such as a logistic regression model is to split your data into train and test data so that you can evaluate the model. RapidMiner provides a node to do this splitting based on the user input. The two main decisions to be made are:
One interesting insight here is that the two target variables ‘Unique page views’ and ‘Average time on page’ are negatively correlated with a correlation coefficient of -0.68. This shows an inverse relationship and hence indicates that there might be a need to give up focus on one of these metrics in order to drive the other one up.  
+
# Partition Ratio
 +
# Sampling Type
 
</p>
 
</p>
  
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Regression Model Analysis</strong></font></div></div>===
+
====<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Partition Ratio</strong></font></div></div>====
 
<p>
 
<p>
Since we have two dependant variables UPVs and ATOP, we will create two regression models. All other attributes in the dataset will be the independent variables. The goal here is to better understand the relationship between these attributes and the content performance attributes - UPVs and ATOP
+
The user, after analyzing the size and nature of the dataset can decide the partition ratio.  There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.<br>
</p>
+
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive). One good practice is to try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. This should see both greater performance with more data, but also lower variance across the different random samples. This is the approach we have considered for our dataset.</p>
  
====<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Unique Page Views as Target Variable</strong></font></div></div>====
+
====<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Sampling Type</strong></font></div></div>====
  
 +
<p>RapidMiner offers three different options for sampling of the data when dividing into partitions. Each one has its own pros and cons depending on the nature of the dataset and goal of the analysis.
 +
Linear sampling simply divides the data into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
 +
Shuffled sampling builds random subsets of the data. Examples are chosen randomly for making subsets.<br>
 +
Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole dataset. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of the class labels. <br>
 +
We proceeded to use stratified sampling with equal proportions of values of the target variable- Unique Page Views being put in both the test and train data. This ensured that both the test and train sets contain the same proportion of ‘Success’ and ‘Failure’ values. </p>
  
[[File:Regression Model with UPV as Dependent Variable.png|frameless| 900px]]
+
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Converting Nominal Attributes to numerical</strong></font></div></div>===
<br>
 
 
<p>
 
<p>
The first regression model, using UPVs as the dependent variable, gives us three attributes with p-value greater than 0.05, indicating that the coefficients can be used to predict the relationship with the dependant variable. At this stage we can drop the attributes that give us a high p-value(0.05). The three attributes that seem to have a relationship with UPV are:
+
The Nominal to Numerical operator is used for changing the type of non-numeric attributes to a numeric type. In logistic regression, this is required to convert the categorical attributes into dummy values. In general for any attribute with ‘k’ values, there will be ‘k-1’ dummy variables. This operator provides many options for recoding for numerical variables into categorical. Below we can see all the options provided in RapidMiner:
 +
* '''unique_integers''': If this option is selected, the values of nominal attributes can be seen as equally ranked, therefore the nominal attribute will simply be turned into a real valued attribute, the old values result in equidistant real values.
 +
* '''dummy_coding''': If this option is selected, for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to 0. Note that the comparison group is an optional parameter with 'dummy coding'. If no comparison group is defined, in every example the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. In this case, there will be no example where all new attributes get value 0.
 +
* '''effect_coding''': If this option is selected; for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to -1.
 +
 
 +
For the purposes of our analysis, we select the ‘Dummy Coding’ as it is a required input for the logistic regression. This models every variable into a column with a binary value, indicating the presence or absence of the given category.
 
</p>
 
</p>
<ol>
 
<li>
 
<b>Organic Searches: </b>An increase in organic searches leads to a positive increase in UPVs
 
</li>
 
<li>
 
<b>Exit Rate: </b>An increase in exit rate leads to a negative change in UPVs
 
</li>
 
<li>
 
<b>Facebook Shares: </b>An increase in Facebook shares leads to a positive increase in UPVs.
 
</li>
 
</ol>
 
  
<p>
+
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Installing the WEKA Extension in RapidMiner</strong></font></div></div>===
One issue with using these coefficients from these 3 attributes is the fact that they may not be in control of the client and hence cannot be considered ‘actionable’. Given that many of the other attributes do not have a predictive linear relationship with UPVs, it is possible that regression may not be the best approach to understand the performance of articles.  
+
 
</p>
+
<p>As the goal of our analysis is to better understand the effects of the different attributes on the dependent variables, we want to be able to interpret the coefficients and odds ratios of the regression. The default logistic regression operator in RapidMiner does not provide the odds ratios and probability outputs of regression and hence we must install the WEKA package add on to RapidMiner in order to run the traditional logistic regression that is similar to what is found in SAS Enterprise Miner.</p>
 +
[[File: Skytrek_LR_weka.png|frameless|900px]]
 +
 
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Discretization of continuous variables</strong></font></div></div>===
 +
 
 +
<p>While the original model results do solve the issue of having attribute value specific coefficients in the form of Odds Ratio, there is still a problem with our numerical attributes. The numerical values Odds Ratios do not provide us with sufficient insights that can be interpreted due to their incremental nature. In making recommendations based on article length, it would not be meaningful to say that articles with 1 word more would increase success rates by 1 percent. With no upper or lower bound, the client would be inclined to create lengthy articles. Rather, recommending that articles of lengths ranging from 900-1200 fare better than 2000-2500 words would be more meaningful. In order to do this we need to divide the numerical variable into ranges so that we can compare the performance of each range in order to provide a recommendation. In order to do this, we consider the process of discretization.</p>
 +
 
 +
<p>We have considered Discretization by Frequency as it provides us with the highest model accuracy and the lowest squared error. It is also a good fit for our overall business problem.  
 +
We have considered 3 partitions per numerical value with the rationale that each partition should have a minimum of over 100 data points. After running the discretization operator we get our output in the form of ranges. We have then plotted the Average Unique Page Views of each partition for a given independent variable. These values have been visualized in tableau and can be found in the image below.</p>
  
====<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Average Time on page as target</strong></font></div></div>====
+
[[File: Skytrek_LR_binning.png|frameless|900px]]
[[File:Regression Model with ATOP as Dependent Variable.png|frameless| 900px]]
 
<br>
 
<p>
 
The regression results show a high p-value (>0.05) indicating that the attributes in the model are not good predictors of average time on page and hence should be dropped. This again indicates that regression may not be the best approach to measuring the performance of these variables.
 
</p>
 
  
====<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Conclusion</strong></font></div></div>====
+
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>RapidMiner model</strong></font></div></div>===
<p>
+
[[File: Skytrek_LR_rapidminer.png|frameless|900px]]
Our regression analysis has shown that from a business point of view that there is no particular attribute that tends to have a strong linear relationship with content performance. This indicates that there is no one ‘winning formula’ when it comes to designing an article, especially in terms of ‘changeable’ attributes such a number of images, links, total words and videos. While organic searches, facebook shares and exit rate did show a significant relationship with Unique Page views, these metric are not very ‘actionable’ meaning they will be difficult for the client to change in the short term, making the regression results not very impactful from a business point of view. This indicates that there might be move value in the qualitative nature ie Content themes and text analysis results in determining what makes a ‘high performing’ news article.
 
</p>
 
<p>
 
One interesting insight from the correlation matrix was that the two target variables ‘Unique page views’ and ‘Average time on page’ are negatively correlated with each other with a correlation coefficient of -0.68. This shows an inverse relationship and hence indicates that there might be a need to give up focus on one of these metrics in order to drive the other one up. This is an important question to consider from a business point of view.
 
</p>
 
  
 +
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Result</strong></font></div></div>===
 +
[[File: Skytrek_LR_result.png|frameless|900px]]
  
===<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Software Tools Assessment</strong></font></div></div>===
+
==<div style="background: #95A5A6; line-height: 0.3em; font-family:helvetica;  border-left: #6C7A89 solid 15px;"><div style="border-left: #FFFFFF solid 5px; padding:15px;font-size:15px;"><font color= "#ffffff"><strong>Software Tools Assessment</strong></font></div></div>==
 
<p>
 
<p>
 
We considered the pros and cons of 3 data analysis tools, namely RapidMiner, SAS Enterprise Miner (EM) and R. While EM offers greater visualization capabilities it is not necessary for our analysis. Considering that we only had 399 articles to deal with, each K-means clustering run (most expensive analysis) averages at a mere 7 minutes. Hence, it would not be reasonable to expect the client to invest in a license.
 
We considered the pros and cons of 3 data analysis tools, namely RapidMiner, SAS Enterprise Miner (EM) and R. While EM offers greater visualization capabilities it is not necessary for our analysis. Considering that we only had 399 articles to deal with, each K-means clustering run (most expensive analysis) averages at a mere 7 minutes. Hence, it would not be reasonable to expect the client to invest in a license.

Latest revision as of 22:19, 17 April 2016

HOME

OVERVIEW

ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION


Content Theme Analysis

We had previously highlighted the 7 Content Themes (CT) Skyscanner believes its articles belong to. The aim of this analysis is 3 fold. To validate if these 7 CT are representative of the article content being written. To identify the top 3 CT with the greatest yield. Lastly, to understand the performance across each CT. As mentioned earlier, we will be measuring yield and performance by the metrics UVP and ATOP.

It was not going to be possible to read each and every single article in order to identify the various CT, hence verifying our client list of CT. Hence, we would employ the use of the K-means clustering algorithm to identify the latent groups of CT within our dataset.

Preparing the Dataset

Our database contains the html for each of the 399 articles hosted on Skyscanner Singapore’s travel news site. RapidMiner was used to clean this data. HTML tags were removed from the html content, leaving only the article content. The content was then tokenized, transformed to lowercase, filtered for stop words from the English dictionary, then filtered for tokens with character length between 3 and 41. Following which, a tf-idf matrix was generated for each every token in each article. Tf-idf was used because it accentuates the value of rare word in distinguishing an article from another, thereby augmenting our goal of discovering the latent CT.

Applying K-Means Clustering and Discovering more CT

In clustering, we seek to reduce the intra-cluster distance while maximizing the inter-cluster distance. The Davies Bouldin Index (DB) captures this information, with the ideal being a lower value. We see a general improvement in DB as the number of clusters (K) increases. From 70 clusters onward, this improvement starts to taper off significantly. This analysis is based on the article text content. Similar analysis was done on the article titles, and K = 20 was found to be a good value.

Validating the Representativeness of the 7 Identified CT

In the course of doing so, we found all 7 CT to be represented in the article. However, we felt it appropriate to generate 2 new CT, ‘Activity/topic discussion’ and ‘Food’ since they represented 17% and 8% of articles respectively. The following pie chart shows the proportion of each CT.
Proportion of articles in each ct.png
City Guides, Inspirational and Activity/topic discussion are the top 3 most represented CT at the moment.


There was a discrepancy in the CT allocation after we analyzed the results of both article text and article title. Upon review of the series of decisions a reader makes in determining whether to invest time in reading the article, we have found article titles to have greater influence. If the title does not incite the reader to read the article, there would be no metrics to gather. Hence, we will focus ensuing recommendations based on the article title clustering.
Skytrek CT DocCount.png

Identifying the Top Performing 3 CT

Under UPV, City Guides, Trending and Food are ranked highest in descending order. Under ATOP, Inspirational, Food and Trending are ranked highest in descending order. It is interesting to note that only Food made it to the top 3 place across both metrics. The client has expressed her opinion that strong performance in both these fields would be a good indicator of high quality traffic since it is both able to get high viewership rates as well as sustain viewer interest in reading the article. Thus, Skyscanner might choose to focus its efforts on writing food articles.

Understanding each CT Performance (Z score)

We used the z-score to evaluate the relative performance of each CT against one another, represented in Figure 4 below. Taking the mean and the baseline for comparison, it is interesting to note that a CT that fares well in one metric tends to do badly in the other. ‘Food’, ‘Product’, ‘Trending’, ‘Domestic/Local’, ‘City Guides’ and ‘Inspirational’ are such CT. This negative correlation between UPV and ATOP is also represented in the correlation analysis done in Figure 5 below.

We would recommend Skyscanner direct more resources to the Food CT in view of its strong metric performance. Conversely, we would recommend to avoid the Product CT in view of its weak performance. However, if this was purposefully done for product brand awareness, another CT to avoid would be Inspirational Topics.

Skytrek CT zscore.png

Article Attributes Analysis

The aim of this analysis is to understanding performance of news article attributes based on organic viewership via Logistic Regression.

Splitting Data

One important consideration while building a predictive model such as a logistic regression model is to split your data into train and test data so that you can evaluate the model. RapidMiner provides a node to do this splitting based on the user input. The two main decisions to be made are:

  1. Partition Ratio
  2. Sampling Type

Partition Ratio

The user, after analyzing the size and nature of the dataset can decide the partition ratio. There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive). One good practice is to try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. This should see both greater performance with more data, but also lower variance across the different random samples. This is the approach we have considered for our dataset.

Sampling Type

RapidMiner offers three different options for sampling of the data when dividing into partitions. Each one has its own pros and cons depending on the nature of the dataset and goal of the analysis. Linear sampling simply divides the data into partitions without changing the order of the examples i.e. subsets with consecutive examples are created. Shuffled sampling builds random subsets of the data. Examples are chosen randomly for making subsets.
Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole dataset. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of the class labels.
We proceeded to use stratified sampling with equal proportions of values of the target variable- Unique Page Views being put in both the test and train data. This ensured that both the test and train sets contain the same proportion of ‘Success’ and ‘Failure’ values.

Converting Nominal Attributes to numerical

The Nominal to Numerical operator is used for changing the type of non-numeric attributes to a numeric type. In logistic regression, this is required to convert the categorical attributes into dummy values. In general for any attribute with ‘k’ values, there will be ‘k-1’ dummy variables. This operator provides many options for recoding for numerical variables into categorical. Below we can see all the options provided in RapidMiner:

  • unique_integers: If this option is selected, the values of nominal attributes can be seen as equally ranked, therefore the nominal attribute will simply be turned into a real valued attribute, the old values result in equidistant real values.
  • dummy_coding: If this option is selected, for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to 0. Note that the comparison group is an optional parameter with 'dummy coding'. If no comparison group is defined, in every example the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. In this case, there will be no example where all new attributes get value 0.
  • effect_coding: If this option is selected; for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be defined using the comparison groups’ parameter. In every example, the new attribute, which corresponds to the actual nominal value of that example, gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of this example corresponds to the comparison group, all new attributes are set to -1.

For the purposes of our analysis, we select the ‘Dummy Coding’ as it is a required input for the logistic regression. This models every variable into a column with a binary value, indicating the presence or absence of the given category.

Installing the WEKA Extension in RapidMiner

As the goal of our analysis is to better understand the effects of the different attributes on the dependent variables, we want to be able to interpret the coefficients and odds ratios of the regression. The default logistic regression operator in RapidMiner does not provide the odds ratios and probability outputs of regression and hence we must install the WEKA package add on to RapidMiner in order to run the traditional logistic regression that is similar to what is found in SAS Enterprise Miner.

Skytrek LR weka.png

Discretization of continuous variables

While the original model results do solve the issue of having attribute value specific coefficients in the form of Odds Ratio, there is still a problem with our numerical attributes. The numerical values Odds Ratios do not provide us with sufficient insights that can be interpreted due to their incremental nature. In making recommendations based on article length, it would not be meaningful to say that articles with 1 word more would increase success rates by 1 percent. With no upper or lower bound, the client would be inclined to create lengthy articles. Rather, recommending that articles of lengths ranging from 900-1200 fare better than 2000-2500 words would be more meaningful. In order to do this we need to divide the numerical variable into ranges so that we can compare the performance of each range in order to provide a recommendation. In order to do this, we consider the process of discretization.

We have considered Discretization by Frequency as it provides us with the highest model accuracy and the lowest squared error. It is also a good fit for our overall business problem. We have considered 3 partitions per numerical value with the rationale that each partition should have a minimum of over 100 data points. After running the discretization operator we get our output in the form of ranges. We have then plotted the Average Unique Page Views of each partition for a given independent variable. These values have been visualized in tableau and can be found in the image below.

Skytrek LR binning.png

RapidMiner model

Skytrek LR rapidminer.png

Result

Skytrek LR result.png

Software Tools Assessment

We considered the pros and cons of 3 data analysis tools, namely RapidMiner, SAS Enterprise Miner (EM) and R. While EM offers greater visualization capabilities it is not necessary for our analysis. Considering that we only had 399 articles to deal with, each K-means clustering run (most expensive analysis) averages at a mere 7 minutes. Hence, it would not be reasonable to expect the client to invest in a license.

On top of the benefits of RapidMiner being a free open-source software, it has a wide selection of operators available for immediate and relevant use in our project. More specifically, they can be used for the ETL process in preparation for K-means clustering. RapidMiner is identified to be capable of accomplishing our project objectives at a lower learning curve and at no monetary cost, hence selected as our tool of choice.

Software Advantages Disadvantages
RapidMiner
  • Fully sufficient ecosystem for complete ETL process.
  • The drawing board is editable during the runtime. Hence you can prepare a new experiment while RapidMiner is still performing calculations on the last experiment.
  • Open source software with many plugin options for extension of feature base.
  • Too easy to setup the flows/operators that they would take eternity to calculate hence leading to loops.
  • Have to rerun whole flow even for a small refinement.
  • You can't stop running a node. You can only tell RapidMiner to prevent execution of subsequent nodes. Hence forced application restarts are common.

SAS Enterprise Miner

  • Nice interactive visualizations. Whenever you click on a label, the corresponding data in the chart gets highlighted.
  • Selective execution in EM is a great feature, which accelerates development, because if you make a mistake at the end of the flow, you don't have to recalculate everything from the scratch
  • Uninformative error messages. You often have to blindly test many different things until you find the root of the problem.
  • Commercial software hence, high cost of setup and service
R
  • Rich functionalities, packages and development tools especially for statistical analysis.
  • Flexible and easy to extend with unmatched charting capabilities
  • Open source software
  • Issues related to security, speed, efficiency and memory management since it emanates from languages built in the 1960s
  • Steep learning curve and disorganized documentation