Difference between revisions of "AY1516 T2 Team13 Natasha Studio Project Overview Methodology"

From Analytics Practicum
Jump to navigation Jump to search
Line 44: Line 44:
 
==<div style="background: #B0C4DE; padding: 10px; font-family:Arial; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 3px; padding 0px 18px 0 px 18 px;"><font color="white">METHODOLOGY</font></div>==
 
==<div style="background: #B0C4DE; padding: 10px; font-family:Arial; font-size: 14px; font-weight: bold; line-height: 1em; text-indent: 3px; padding 0px 18px 0 px 18 px;"><font color="white">METHODOLOGY</font></div>==
  
<b><u>Tools used</u></b>
+
<h3>Relational Database</h3>
<br>The main tools that will be used are Microsoft Excel and R. Microsoft Excel will be used for the data preparation and cleansing process as it is preferred by our sponsor. R will be used mainly to build and evaluate the models. The open source nature of R would also allow our client to also use it with future data.  
+
Currently, Natasha Studio uses a flat file database, though a simple excel spreadsheet to keep all their records. Day-to-day operations would require the counter staff to enter information into the excel spreadsheet whenever there is a new sale. This method of storing data encourages inconsistency and could potentially result in many problems. In addition, Natasha Studio’s move over to hardcopy data through logs book further increased the data inconsistency. Missing data was also much more apparent as seen in our data cleaning and exploratory data analysis. For example, in the log book, there are many occurrences in which the type of genre for open classes packages is not recorded. As such, this is more pertinent analytical problem and we would advise our client to use a relational database management system (RDMS) instead for its many advantages explained below.
<br><br>
+
 
<b><u>Data Extraction</u></b>
+
<h3>Evaluation of database tools</h3>
<br>At present, data from July 2010 – August 2012 is available to us on Microsoft Excel. At Fall 2012, the business owner decided to stop using the system. Thus, purchases and attendance were recorded on paper. As a result, work has to be done during the data assessment process to enter data into the spreadsheet so that there are more data to work with.
+
[[File:evaluation.png|700px]]
<br><br>
+
<p>Based on our client’s requirements, our team finds that SQLite would be a better choice for Natasha Studio. Although MySQL seems like the most popular solution that is used by most of the big players, it requires a server to connect to the database. Our client has mentioned to us that different counter staff uses different computers and there might not be Wi-Fi readily available at the studio. Thus, the server-less SQLite would be a more appropriate choice. Even though PostgreSQL is highly customizable, it is also not suitable for our client due to its overly complex nature. Its steep learning curve with regards to daily usage is likely to pose as a barrier for our client too.</p>
<b><u>Data Preparation</u></b>
+
 
<br>In addition to the data that is missing from 2013-2015, the current data set that was presented to us would require significant efforts in data cleansing due to the inconsistency, missing data fields and duplicates that has been resulted from bad practices throughout the 3 years. We foresee that huge amount of time will be used for this process.
+
<h3>Association Rule Mining</h3>
<br><br>
+
Currently, our client does not have a formal sales monitoring system. Customers’ purchases of dance packages are simply recorded either onto an Excel spreadsheet or a log book without much further analysis. Our preliminary data exploration highlighted the transactional nature of the purchases data. In the purchases data, packages are bought by a particular member at a particular date. Thus, given the availability of transaction data, we apply Association Rule Mining (ARM) to this dataset, hoping to identify purchasing patterns which would be useful to our client in marketing and promoting dance packages to their future members. The results will enable us to implement suitable sales promotions, cross-selling promotions and recommendations in order to capture more sales per customer. Furthermore, sequence discovery will be used to analyze the purchasing patterns of customers. This allows us to understand the trends and decisions made by the customers after the expiration of every package. Following which, suitable sales and marketing efforts can be implemented in order to drive repeat sales.
<b><u>Data Validity</u></b>
 
<br>As of the data that was presented to us, there were a total of 1717 members and 3044 purchases. The numbers are not final as data cleansing has yet to been done during this stage of the project. However, based on visual observation of the data, we are confident that it should not deviate too far away from the reported numbers as shown above.  
 
In order to perform a substantial analysis on the data, a rough estimate of 5000 data points is required for the proposed techniques shown below. The current data set is not sufficient as it does not meet the required sample size. However, the business owner has informed us that he has a rough estimate of 5300 members. As such, we are confident that after the data assessment process, we will have sufficient data to work on.
 

Revision as of 19:34, 28 February 2016

HOME

TEAM

PROJECT OVERVIEW

FINDINGS & ANALYSIS

PROJECT MANAGEMENT

DOCUMENTATION

BACKGROUND DATA METHODOLOGY


SCOPE OF WORK

Scope.png

METHODOLOGY

Relational Database

Currently, Natasha Studio uses a flat file database, though a simple excel spreadsheet to keep all their records. Day-to-day operations would require the counter staff to enter information into the excel spreadsheet whenever there is a new sale. This method of storing data encourages inconsistency and could potentially result in many problems. In addition, Natasha Studio’s move over to hardcopy data through logs book further increased the data inconsistency. Missing data was also much more apparent as seen in our data cleaning and exploratory data analysis. For example, in the log book, there are many occurrences in which the type of genre for open classes packages is not recorded. As such, this is more pertinent analytical problem and we would advise our client to use a relational database management system (RDMS) instead for its many advantages explained below.

Evaluation of database tools

Evaluation.png

Based on our client’s requirements, our team finds that SQLite would be a better choice for Natasha Studio. Although MySQL seems like the most popular solution that is used by most of the big players, it requires a server to connect to the database. Our client has mentioned to us that different counter staff uses different computers and there might not be Wi-Fi readily available at the studio. Thus, the server-less SQLite would be a more appropriate choice. Even though PostgreSQL is highly customizable, it is also not suitable for our client due to its overly complex nature. Its steep learning curve with regards to daily usage is likely to pose as a barrier for our client too.

Association Rule Mining

Currently, our client does not have a formal sales monitoring system. Customers’ purchases of dance packages are simply recorded either onto an Excel spreadsheet or a log book without much further analysis. Our preliminary data exploration highlighted the transactional nature of the purchases data. In the purchases data, packages are bought by a particular member at a particular date. Thus, given the availability of transaction data, we apply Association Rule Mining (ARM) to this dataset, hoping to identify purchasing patterns which would be useful to our client in marketing and promoting dance packages to their future members. The results will enable us to implement suitable sales promotions, cross-selling promotions and recommendations in order to capture more sales per customer. Furthermore, sequence discovery will be used to analyze the purchasing patterns of customers. This allows us to understand the trends and decisions made by the customers after the expiration of every package. Following which, suitable sales and marketing efforts can be implemented in order to drive repeat sales.