ISSS608 2016-17 T3 Group8 Arules Final Projrct

A Visual Application for Better Business Decision Making

Introduction

Project Proposal

Final Report

Application

Project Motivation

Association Rule Mining is Powerful
Although association rule mining is usually applied in market basket analysis to mine the relationship between different products, it is actually a very powerful algorithm that can be applied for any dataset to discover the association, correlation and causation between variables. With an interesting target variable, we actually can find out the relevant association rules within the dataset, even if it is not a transaction data. Click here to read more on association rules discussed in our proposal

However, not all datasets are ready to do this kind of data mining in R, for example continuous variables are not easy to handle, which is a barrier for users to step into the world of data analytics. We intend to develop an association rule mining application for non-statistician users to understand easily, to play with and get insights, and to bring them into data analytics using the fundamental yet powerful concept - probabilities.

Room for Improvement of Current Packages

Current R packages available for association rule mining are helpful for data analysts, but their limitations also brings difficulty to interpret the analysis results:

1) Static Visualizations
Visualizations provided in current ARM packages are mostly static, which are difficult for users to do see difference between different settings on the rules without editing a line of code. For example, it is hard to see how many rules were kicked-out if support threshold was increased by 0.005.

2) Lack Interactivity
There is little interactivity in the current ARM packages. Users are unable to select or zoom in on the visualizations.

3) Manual Calibration
Users are only able to manually calibrate the generated rules by changing thresholds for support, confidence, lift, antecedent or consequent.

R Packages Used

For interactive application: R Shiny and Shiny Dashboard

Shiny is an R Studio package for developing interactive charts, data visualizations and applications to be hosted on the web using the R programming language. It enables developer to make an interactive application which allow user to understand a certain model or do some data explorations. In this case, we could visualize the underlying rules beyond given datasets which show a clear picture of how those items correlate with each other. Package ‘shiny’Package ‘shinydashboard’

For mining association rules: arules family of packages

“Arules” is the very foundation on which we built this application. “Arules” enables users to apply association rule mining algorithms on transaction data or any other data that meets certain requirement. It is quite powerful at manipulating and transforming data, pruning redundant rules, as well as filtering association rules generated. Users can filter the rules by customizing thresholds for support, confidence, and lift, as well as the antecedent and consequent, and sort the rules by support, lift and confident. Package ‘arules’Package ‘arulesViz’

For interactive scatterplot: ggplot2 and plotly Package ‘plotly’ Package ‘ggplot2’

For network visualization: visNetwork and igraph Package ‘visNetwork’Package ‘igraph’

For data preparation: sqldf(for SQL operations in R),dplyr,stringr Package ‘sqldf’ Package ‘dplyr’ Package ‘stringr’

Choice of Visualizations and Critics

This section discusses the choices of visualizations used in our application with respect to their usefulness. Critics on the default visualizations provided in the arulesviz packages will be discussed as well to the areas for improvement for our visualization designs.

Discussion

Visualization

Benefits of scatterplots

Good for multivariate comparison of support & confidence, colour by lift
Good for stats explorer for general MBA – The fact that all the itemsets in the transactions are important to the user, it is good to have an overview of the stats first before the users decide on which rules they wish to further investigate or action upon.

The scatterplot on the left is generated by the arulesviz package, the limitations are:

Too clustered, overlapped dots
Loss the information of associations
Manual calibration of 3 interestingness statistics

Benefits of network diagrams

Clearly shows the casual relationship between the LHS items and RHS items (From & to)
Differentiation of rules & itemsets as both rules and items are represented separately
Interactions of rules,itemsets & 3 stats – allows us to visualize which rules are more important than others and which items are more popular/unpopular

The network graph on the left is generated by the arulesviz package, the limitations are:

Confusing
Less room for user interaction
Loss the information on 3 stats (only can see one, not three together)
Manual configuration

Application Design in Details

Click here to view our application dashboards designs at a glance

1. Load data

1)Choose File to Upload

Users can upload any dataset they want as long as they are in the following format:

1.Market Basket Analysis

Single Format: Col1 = Transaction ID, Col2 = Item Name(single)
Basket Format: Col1 = Transaction ID, cOL2 = Item Names(Multiple)

2.Targeted ARM Any dataset that contains a target variable that the user is interested in. The users are able to choose if the dataset contains a header and the separator for the file.

2.Import Data

Once the data file is uploaded and the users are ready to do the association rule mining on this dataset, they can click the “Import data” button and then the data would be imported into our server and saved as a data frame, which would be used for all the following data transformation, analyses and visualizations.

This is done by using “eventReactive” function in R shiny. To save a data frame dependent on user’s uploaded file, we make a reactive data frame that would be stored only when some event happens, which, in this case, is clicking on the “Import data” button.

3.Variable Transformation

Check binary:

binary_check=apply(HRdata,2,function(x) { all(na.omit(x) %in% 0:1) })

Transform binary:

change_to_logical=as.data.frame(lapply(HRdata[binary_check],function(x) as.factor(as.logical(x))))

Check numeric:

numeric_check=sapply(df1,is.numeric)

Transform numeric:

change_to_factor=as.data.frame(lapply(df1[numeric_check],function(x) factor(ntile(x, 3),levels = c(1,2,3),labels = c("low","mid","high"))))

a) Transforming binary column Columns containing numbers of only 1 or 0 are considered as binary columns. (NA is allowed). Binary variables are recoded to “True” or “False”

b) From the remaining columns, transform numeric columns to 3 bins “low, mid, high” based on quantiles.

This step can be improved by allowing the user to choose the number of bins and the naming conventions for each bin.

c) Categorical columns: unchanged Users were educated in the user guide to not using numeric numbers to represent categorical information, otherwise they will be transformed in step 2.

d) Combing the transformed columns and the original categorical columns back to form the new dataframe for association rule minng.

2. Scatterplots Visualizations:

Design Concepts

Visualization

2.1Overview Scatterplot with Box Selection

Once rules are generated, they will be saved as data frame in the server. Each row represents one rule, which consists of the following variables: ruleID, Antecedent, Consequent, support, confidence, lift. The first visualization that users would see is an overview scatterplot.

We plot all generated rules on a scatter plot where X axis is support and Y axis is confidence, colored by lift. Optimally, we would focus on those points at up-right corner of the plot and in darker red color, which is interpreted as high support, high confidence and high lift. This is a static scatterplot but we enable users to do box selection on it and they would see: 1) a zoomed-in interactive scatterplot on the right hand for them to further investigate those rules that they are more interested in; 2) a table of rule details for them to browse details of all selected rules; 3) a rule network to visualize all rules selected and the relationship between all related items. To achieve this function, we add in one parameter in this plotOutput called “function”: This “brush” function would read where the users click, drag, and release on the plot, show it on the plot as a box selection and store the box’s value range on X and Y axis.

The box’s value range thus serve as constraints on support and confidence to subset the rules for further visualization.

2.2 Zoomed-in Scatterplot

Our zoomed-in scatterplot is an interactive one, built using “ggplot2” wrapped in “plotly”.
Within the X, Y value range defined by box selection, users can still further zoom-in or zoom-out on the zoomed-in scatterplot. More importantly, once the user hovers on any point, a tooltip will pop out showing the support, confidence and lift of this rule, and its lhs and rhs content, which gives the user a clearer picture what are there inside the rules they selected.

2.3 Rule Details

We also provide a data table below the zoomed-in scatterplot, which gives rule details for users to have a full-screen view of selected rules. This is done by using “brushedPoints” function which subset data based on brushed points.

3.Network Visualizations

The network visualizations were built using package “vizNetwork” and “igraph”. The trick to create a network visualization from the association rules is to first visualize the rules using the default network visualization provided by the “arulesviz” package (graph method = “graph”). The arulesviz uses “igraph” package to visualize the rules into network diagram. Next, using the “get.data.frame” function from igraph package will automatically create two data frames, one for the nodes and one for the edges. VizNetwork packages could then be used to create the network visualizations. Viznetwork provides additional customization functionalities to the netowork diagrams like highlight the nearest nodes or allow the users to select certain nodes to zoom in.

It shall be highlighted that the vertices dataframe created by the “get.data.frame” function has a little bug. The quality measures are not correctly matched to the rules, with the last rule always showing having blank quality measures. This affects the visualization if we wish to reflect the interestingness of the rules in the network diagram, eg size the rules nodes by “lift”. The bug was corrected by adjusting the dataframes back so that the quality measures are in line with the individual rules. We also created additional columns to colour the rules and the items respectively to better differentiate them in the network diagrams.

The table below explains how to interpret the network visualizations.

Descriptions

Visualization

Grey nodes: individual rules

Size of the grey nodes: indicating the interestingness of the rule, the bigger the better. By default, it is set to “lift” which can be changed to “support” or “confidence”

Red nodes: indicating individual items, they are linked to the rules by edges with arrows

Clicking on any rule, the items linking to that rule will be highlighted and the rest will be faded.
The arrows show the direction of influences.

LHS (antecedent) items will have arrows pointing to the rule nodes, indicating they are the factors of influences.

RHS(consequent) items will have the arrows pointing from the rule nodes, indicating they are the outcome.

4.User interactivity of the application

1.Model calibration (available for both generic MBA dashboard and targeted ARM dashboard)

Select the “support” threshold with a slide bar
Select the “confidence” threshold with a slide bar
Select the number of items to show at the LHS, or as antecedent with a slide bar. This defines the length of the rule

Below calibrations are only available on the targeted ARM dashboard.

Filter the RHS or the consequent to display the interested target only, by entering free text. This can be further improved by populating the list of items in the user defined consequent column.

Choose to size the rule nodes by one quality measure, defaulted as lift with an option list

Choose the number of rules to display with a slide bar

Use Cases

1.Market Basket Analysis

The data used was the default "Groceries" data contained in the arules package.
Threshold: support>0.002, confidence>0.05, length of rule: 3, size of rule nodes by: lift

1.Select the rules with highest confidence & lowest support

Items likely to be purchased together but not popularly purchased

The rule with highest lift is: {tropical fruits,whole milk,grapes} =>{other vegetables}

2.Select the rules with highest support at varying confidence

Most popular combinations of items but not always in the same combinations

The rule with highest lift is: {yoghurt,other vegetables,whole milk} =>{tropical fruit}

3.Select the rules with highest lift

Most useful rule but not a very popular choice (low support)

The rule with highest lift is: {yoghurt,root vegetables,bottled water} =>{tropical fruit}

2.Targeted ARM Using HR Data – Why Did the Employees Leave?

The data used was the HR Analytics data downloaded from Kaggle on why employees left the company. Human Resources Analytics
Threshold: support>0.001, confidence>0.5, length of rule: 3, RHS: left=”TRUE” (only showing who has left)

1.The rules with highest lift Which factors combining together will lead to employee leaving? {average monthly hours=low, satisfaction level =low, number of projects =low}

2.The most popular reason why people left The item with linked to most rules -> {satisfaction level=low}

3.Zoom to a specific item(reason) On the contrary to common beliefs, {salary=low} is not one of the most popular reason on why emolyees quit, it is only associated with one rule, whose lift is only in the 13th out of all the 15 rules generated (lift 3.1)

References

Acknowledgements

Thanks Prof Kam for the guidance on our project! Special thanks to the kind souls on Stackoverflow for clearing the pieces of obstacles blocking our way.

ISSS608 2016-17 T3 Group8 Arules Final Projrct

Contents

Project Motivation

R Packages Used

Choice of Visualizations and Critics

Application Design in Details

Use Cases

References

Acknowledgements

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools