Difference between revisions of "AY1516 T2 Team13 Natasha Studio Findings RuleMining"

From Analytics Practicum
Jump to navigation Jump to search
Line 52: Line 52:
  
  
<h2>Next Steps</h2>
+
<h2>Statistical Analysis</h2>
The next step would be to further calibrate the model to adjust maximum items and other parameters. Also, at this point, time has not been taken into account. Thus, our team is looking towards performing sequence discovery in SAS EM using “date purchased” as well.
+
[[File:2010-12ARMStats.png|2010-12ARMStats]]
We would also be analysing using other target variables like “Package” (Genre) and “Course / Open & Level” with the MemberIDs for further analysis.
+
[[File:2014-15ARMStats.png|2014-15ARMStats]]
 +
<p>From the two tables above, with the top table being 2010-2012 and bottom 2014-15, we see that the average of the 3 measures for both 2010-2012 and 2014-2015 is about the same. The average lift value is just above 1, indicating limited strength of the rules. The support % is also fairly low at 5.46%, meaning that, on average these rules may not occur very often in our dataset. Thus, it is likely that some rules would be irrelevant and thus be taken out of our analysis if their support % is too low or their lift value is less than 1.
 +
<br><br>However, the max statistics shows promising results, with a support % as high as 22.63, confidence 55.09 and lift of 6.11 in 2014-2015. This means that there are rules that represent a large proportion of the dataset and have high likelihood in the form of high lift values. Thus, even though the quantity of rules may be small, it does not necessarily mean low quality of rules. </p>

Revision as of 13:39, 17 April 2016

HOME

TEAM

PROJECT OVERVIEW

FINDINGS & ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION

EXPLORATORY DATA ANALYSIS OTHER ANALYSIS DATABASE CREATION ASSOCIATION RULE MINING LOGISTIC REGRESSION

Process Flow

Process Flow of ARM Analysis

Using SAS® Enterprise Miner 12.1, we performed ARM using the Association node to discover associations between “Member” as the ID variable and 1) “Price Package” 2) “Package (Genre)” 3) “Course/Open & Level” as 3 different Target variables. This allowed us to identify key associations between the different packages that customers would purchase.

Due to our missing data gap in 2013, we had first split our preliminary analysis into two; 2010-2012 and 2014-2015. This would allow us to see if the association between the time period are different and whether there is a need to split our subsequent analysis. Preliminary results showed that the association discovered does differ between the two time periods. Consequently, we proceeded to analyze them separately.

We also applied sequence discovery to enhance our ARM model. By adding “Date Purchased” as a Sequence variable, time of purchase is taken into account. We find this enhancement is necessary for our model as customers typically do not buy more than 1 package at the same time. Instead, they would buy 1 package, utilize it and then buy another. Thus, taking into account time is necessary in our analysis. Hence, we believe that findings for sequence discovery should be more applicable.

Calibration of ARM Analysis

The above shows our final calibration of our model. It was designed to give us the most ideal set of rules.

Comparing between our association results as well as our sequence results, we find that focusing on our sequence results is sufficient as generally, the same rules are flagged out under both analyses. The key difference is that, as mentioned, the time factor being taken into account. As such, we proceeded to focus on our sequence analysis.


Statistical Analysis

2010-12ARMStats 2014-15ARMStats

From the two tables above, with the top table being 2010-2012 and bottom 2014-15, we see that the average of the 3 measures for both 2010-2012 and 2014-2015 is about the same. The average lift value is just above 1, indicating limited strength of the rules. The support % is also fairly low at 5.46%, meaning that, on average these rules may not occur very often in our dataset. Thus, it is likely that some rules would be irrelevant and thus be taken out of our analysis if their support % is too low or their lift value is less than 1.

However, the max statistics shows promising results, with a support % as high as 22.63, confidence 55.09 and lift of 6.11 in 2014-2015. This means that there are rules that represent a large proportion of the dataset and have high likelihood in the form of high lift values. Thus, even though the quantity of rules may be small, it does not necessarily mean low quality of rules.