AY1516 T2 Team13 Natasha Studio Project Overview Methodology
BACKGROUND | DATA | METHODOLOGY |
---|
Contents
SCOPE OF WORK
METHODOLOGY
Relational Database
Currently, Natasha Studio uses a flat file database, though a simple excel spreadsheet to keep all their records. Day-to-day operations would require the counter staff to enter information into the excel spreadsheet whenever there is a new sale. This method of storing data encourages inconsistency and could potentially result in many problems. In addition, Natasha Studio’s move over to hardcopy data through logs book further increased the data inconsistency. Missing data was also much more apparent as seen in our data cleaning and exploratory data analysis. For example, in the log book, there are many occurrences in which the type of genre for open classes packages is not recorded. As such, this is more pertinent analytical problem and we would advise our client to use a relational database management system (RDMS) instead for its many advantages explained below.
Evaluation of database tools
Based on our client’s requirements, our team finds that SQLite would be a better choice for Natasha Studio. Although MySQL seems like the most popular solution that is used by most of the big players, it requires a server to connect to the database. Our client has mentioned to us that different counter staff uses different computers and there might not be Wi-Fi readily available at the studio. Thus, the server-less SQLite would be a more appropriate choice. Even though PostgreSQL is highly customizable, it is also not suitable for our client due to its overly complex nature. Its steep learning curve with regards to daily usage is likely to pose as a barrier for our client too.
Association Rule Mining
Currently, our client does not have a formal sales monitoring system. Customers’ purchases of dance packages are simply recorded either onto an Excel spreadsheet or a log book without much further analysis. Our preliminary data exploration highlighted the transactional nature of the purchases data. In the purchases data, packages are bought by a particular member at a particular date. Thus, given the availability of transaction data, we apply Association Rule Mining (ARM) to this dataset, hoping to identify purchasing patterns which would be useful to our client in marketing and promoting dance packages to their future members. The results will enable us to implement suitable sales promotions, cross-selling promotions and recommendations in order to capture more sales per customer. Furthermore, sequence discovery will be used to analyze the purchasing patterns of customers. This allows us to understand the trends and decisions made by the customers after the expiration of every package. Following which, suitable sales and marketing efforts can be implemented in order to drive repeat sales.
Sequence discovery
ARM can also be further expanded to include a time series aspect, known as sequence discovery. Previously, ARM did not take into account the time aspect and identifies rules based on the collection of transaction. Sequence discovery goes a step further by adding time into the mix. From sequence discovery, the rules discovered are not just a bundle of products but also when the products would be bought. For example, for the group that purchases “Unlimited Any Classes: 1 Month” and “08 Open Classes” packages, the sequences of package purchase might reveal deeper trends on their package preferences and learning needs.
ARM Tools
Several ARM tools are available in the market and the suitability of each tool in the context of this project will be considered.
Excel
Excel is able to perform ARM using the SQL Server Data Mining Add In. However, the ability to calibrate the model is fairly unlimited, only allowing one to adjust the “minimum support” and “minimum rule probability”.
SAS Enterprise Miner
SAS Enterprise Miner is a powerful data mining tool. It has a user-friendly interface which allows the user to easily adjust the thresholds in which an association would be identified. SAS EM also has the option to perform sequence discovery.
Data Mining using R
R is an open source tool which is very versatile in applying many different data mining techniques. Packages like Rattle provide a useful graphical user interface for data mining in R. However, in terms of analysis of the results generated, rattle has less graphic charts generated as compared to SAS EM.
RapidMiner
Rapid miner is an open source data mining software with a user friendly-interface. It also has the functionality to adjust thresholds such as minimum support and minimum confidence.
In light of the software available, our group has chosen to use SAS EM for our market basket analysis. One of the major advantages of SAS EM is its ability to perform sequence discovery easily concurrently with ARM. Our preliminary data exploration has shown that members often do not buy more than one package at the same time. Instead, members would buy one packages, use it and buy another packages subsequently. Thus, sequence discovery would give a deeper insight to customer’s purchasing pattern. Despite not being open source software, we find that the open source criteria is not critical for our ARM analysis as our client would not be performing such analysis on its own, unlike the relational database. Though SAS EM, we will build a ARM model, keeping in mind the 3 measures (support, confidence, lift) and 3 types (useful, trivial, inexplicable) mentioned above while modelling.
Logistic Regression
Currently, Natasha Studio does not explicitly keep track of the customer’s utilization of its packages. Thus, in order to help Natasha Studio improve its business competitiveness, we are looking to analyse if there is some sort of relationship between one’s utilization of their purchase and their subsequent repurchase rate. We will be applying logistic regression for this purpose.
Statistical Analysis Tools
Several statistical analysis tools are available in the market. Given that logistic regression is a widely known analytics techniques, majority of statistical analysis tools are capable to performing logistic regression, from Excel to R to SAS to SPSS. We find that these tools do not differ significantly in the application of logistic regression. Thus, our team has decided to use SAS Enterprise Guide for the purpose of our project.