Teppei Syokudo - Improving Store Performance: PPA Data Analysis Methodology

From Analytics Practicum
Jump to navigation Jump to search


Home   Product Portfolio Analysis   Improving Store Performance   Project Management   Documentation   The Team
  Introduction Data Analysis Methodology Findings References  

Data Exploration

Looking at both MW and RP’s product sales, the sale of Main - Meal has a decreasing trend, together with Main - Drink. This is likely due to the introduction of Set Menus, where customers tend to prefer purchasing sets rather than ala carte. The indirect relationship between Main - Meal and Set Menu is a lot more obvious in RP. We can see that the moment Set Menu was introduced in November, Main - Meal sales started dropping.

The most popular Main - Meal in both outlets would be the Kaisendon, which is sold 60 times and 50 to 60 times daily on average in MW and RP respectively. It’s relating set is the Seafood Feast which averages 40 and 24 times daily in MW and RP respectively.

We can also see that the sale of Main - Onigiri and Main - Fried have relatively stagnant to decreasing trends in both outlets. This means that the onigiris and fried items may not be very popular items. In order to boost sales of onigiris and fried items, Teppei Syokudo may want to consider introducing onigiri sets and fried item sets.

1.png 2.png 3.png 4 apsm.png


Data

Traditionally data related to Market Basket Analysis is three-dimensional: Customers, Orders (i.e. item sets, purchases or baskets), and Items (Beery and Linoff, 2004). A sales order is a most essential and basic piece of information representing a single purchase or transaction made by a customer. Besides main information such as the product bought, the quantity of products bought and total amount of the purchase, the store number, cashier number, type of payment or even the cashier who served is also stored in the order data. The items or rather the contents of the order is most important and founds the basis of identification of association rules. Last but not least, customer information provides a deeper level of analysis by finding associations between certain customer traits and profiles and particular items, allowing the store to carry out market segmentation. (Ting, Pan and Chou, 2010).

A market basket database typically consists of a large number of transaction records. Each record lists all items purchased during a single customer transaction. The objective of this data mining exercise is to identify if certain groups of items are usually purchased together, providing meaningful association rules.

Analysis Tool Selection

In carrying out Market Basket Analysis certain considerations have to be made. One important factor is the software or tool used to carry out Market Basket Analysis. Based on the client requirements in this project, the tool used must be one that is open-source and easy to use. While the team understands that there are far greater utility in employing paid software such as Clementine (SPSS), Enterprise Miner (SAS), GhostMiner 1.0, Quadstone or XLMiner, this requirement essentially narrows down the tools that the team is able to use (Haughton et. al., 2003). The tools that are open-source are narrowed down into 3 tools: RapidMiner, R and Tanagra.

Ppa-table1.jpg

After evaluating the 3 tools, the team realized that though R provided measures and customizability, the learning curve to use R is extremely steep and may not be best for the client based on the non-programming nature of their background. Both RapidMiner and Tanagra is extremely lightweight and easy to use, however the presence of extensive interestingness measures caused the team to choose in favour of RapidMiner.

Analysis Breakdown

The study will carry out the analysis in the following flow:

  1. Products that are already in a set are first analysed. Since there is already an association between items found in a set, the confidence of set components are analysed to identify the most popular side dishes / drinks as well as the most unpopular ones.
  2. For products that are not within a set, Market Basket Analysis is carried out to identify association rules between products.
  3. Products with low profitability is identified using cost and revenue figures provided by the store; products that may contribute to the “profitable-product death spiral” is prevented from being dropped.

Data Cleaning and Procedures

Most Point-of-Sales (POS) systems will provide transaction data that consists of information such as transaction id, product id, quantity, price per product, date etc. The data given by the POS system for this example is in the following format:

Ppa-table2.jpg

The team has formatted and exported the data in a “Comma Separated Values” (CSV) file where the "t_id" is the transaction id, the “p_id” is the product id and the "qty" is the quantity of such products sold in a given transaction.

Ppa-table3.jpg

However, this data requires further transformation to transform from transactional data to market basket data. Ideally, market basket data should be represented in a binary format where each row is a separate transaction id and each column corresponds to a product or item sold. While the quantity is provided and analysed in this example, RapidMiner does not include quantity in the analysis of the result.

Ppa-fig2.jpg

Note: our team assumes that the data provided by the POS System is in a most basic relational format as shown above. If the data has been transformed in any particular way that makes certain transformation steps redundant e.g. the data is already in a binomial form, the respective steps can be skipped.

In order to transform the data into a suitable format to apply the identification of frequent itemsets and the generation of the rule, the team carried out the following process (Deshpande, 2012) in RapidMiner:

Ppa-fig3.jpg

Insert Data. Firstly, we loaded the CSV file into the RapidMiner repository by clicking on “Add Data”. RapidMiner should detect the file’s format and delimit the rows automatically by “,”. Once the file is successfully formatted and stored, the following result should be seen:

Ppa-fig4.jpg

Now that the data is in the local repository, we can drag it into the process and it depicts the start of the analysis process. A “Retrieve <name of file>” operator should appear. From the transaction data, we kept only the columns that we require. We connected the operator “Select attributes” to the “out” of our “Retrieve Data” operator and select the filter type “subset” to retain only the relevant columns “t_id”, “p_id” and “qty”.

Ppa-fig5.jpg

Pivot Transformation. Next, we have to group the transactions by the transaction id with products represented in subsequent columns; this operation produces the result that was previously shown in Figure 2. The “Pivot” ‘operator is used and it’s connected to the “out” of our “Select Attributes” operation; under the “group attribute” parameter, we select “t_id” and under “index attribute” we select “p_id”. The output from this set is

Ppa-fig6.jpg

Replacing Missing Values. Since there were some products that had 0 quantity for particular transactions, a “?” or a missing value is retained during our “Pivot” operator. We have to connect a new operator “Replace Missing Values” to the “exa” of the “Pivot” operator to replace these missing values with a “0” to correctly reflect the quantity of the product bought in each transaction. Under the “default” option, “zero” is selected.

Ppa-fig7.jpg

Set Role ID. After the above transformation, all the new columns produced are attributes with the same role. However, in order to better represent the data, and also to prepare it for the later parts of the process, we have to give some of the columns a certain “role”. This “role” sets the kind of part that an attribute plays in a data set or a process. In this example, we are specifically changing the role of “t_id” to id, since it is indeed used as a unique identifier for each role. This removes the attribute “t_id” from analysis and leaves it as an identification attribute in the later parts of the process. The “Set Roles” operator is selected and connected to the “exa” of the “Pivot” operator. Under the “attribute name” parameter, “t_id” is selected and the “target role”, “id” is selected.

Ppa-fig8.jpg

Binomial Representation. Unfortunately, within RapidMiner the quantity of products purchased within a transaction is not relevant but just simply if a product is purchased or not. The algorithm used later requires the transaction data to be represented in “binomial” values – meaning the analysed attribute has only exactly two possible values, “true” if the product is purchased in a transaction and “false” if a product is not purchased in a transaction. In the “Numerical to Binomial” operator in RapidMiner, the attributes are transformed by checking if they are between a minimal and maximal value; if an attribute falls between the these values, it takes on the attribute “false”, otherwise “true”. By default, the minimal and maximal values are set as “0.0” and “0.0” respectively and hence if an attribute’s original value is 0, it will be transformed to “false”.

Ppa-fig9.jpg

Now that the data is prepared, we select the two main operators that carry out the analysis – “FP-Growth” and “Create Association Rules”. FP-Growth is one of the algorithms used in generating frequent itemset. “Create Association Rules” is used to find association rules between the frequent itemsets generated. The following is the completed process in Rapidminer:

Ppa-fig10.jpg

Frequent Itemset Generation. This particular problem requires us to find sets of items that appear above a percentage threshold of the total number of transactions; as mentioned this is defined as the “minimum support” criteria. Frequent itemset generation is seen as the prelude to association rules discovery. In RapidMiner, the “FP-Growth” operator seeks the common itemsets and the more complex discovery carried out by “Create Association Rules” is derived from the results of “FP-Growth”.

The reason why RapidMiner uses FP-Growth is that by building a FP-tree data structure of the data set, a very compressed copy of the data in created. This allows the computation to fit into the main memory even for large data sets. Usually compared to Apriori algorithm, the major advantage of FP-Growth is that it only takes 2 data scan and hence is more suitable for larger data sets. The “FP-Growth” operator only takes in binomial attributes, as we have previously ascertained.

There are two ways of using the “FP-Growth” operator:

  1. The first way allows the user to specify a number of products with the highest support to be selected; this is regardless of the minimum support threshold. This mode is used when we do not have a clear idea of a minimum support to set.
  2. The second way relies on the minimum support threshold and returns the itemsets with a support greater than the support value provided.

The two modes are determined by the “find min number of itemsets” parameter; when it is set to true the first way is selected, otherwise the second. In this example, after the “FP-Growth” operator is selected, it’s connected to the “exa” of the “Numerical to Binomial” operator. The “find min number of itemsets” is deselected and minimum support threshold of “0.05” is selected. We can select the minimum and maximum size of the itemsets; a “Min. Size” and “Max. Size” of “1” and “3” is selected respectively.

Ppa-fig11.jpg

Association Rule Discovery. TFinally the operator “Create Association Rules” is added and connected to the “fre” output of FP-Growth with the “fre” input of this operator; doing this will deliver both the frequent item sets “fre” and the association rules “rul” to the result ports “res” on the right side. This step is where the analysis of the data to provide association patterns based on the frequent itemsets provided is carried out. The support and confidence criteria is used to identify the most important relationships. In this operator, the antecedents are represented as “Premises” and the consequents are represented as “Conclusions”. The confidence criteria is the percentage of a particular premise-conclusion statement appearing in the entire transaction data set. The “min confidence” parameter sets the minimum confidence criteria for a particular statement to be selected. A “min confidence” of “0.8” is selected. Six interestingness measures are provided from this - we’d go into a deeper analysis of these measures in the next part of the report.

Ppa-fig12.jpg

Analysis Measures

While association analysis aims to detect relationships between items in a data set, the real value from these analysis is finding connections between items that don’t seem to intuitively have a relationship. However, in order to find out if a rule has statistical value and is not purely due to chance, we have to analyse the measures that are used to ascertain the interestingness of an association rule. In RapidMiner, besides support and confidence, there are 4 other common measures that are used, namely – “LaPlace”, “Gain”, “p-s”, “Lift” and “Conviction”. The formula of the measures are as follows:

Ppa-table4.jpg

In order to ascertain which interestingness measures are more meaningful, we analysed the measure based on three key properties (Piatetsky-Shapiro, 1991):

  • Property 1: M = 0 if A and B are statistically independent;
  • Property 2: M monotonically increases with P(A, B) when P(A) and P(B) remain the same;
  • Property 3: M monotonically decreased with P(A) (or P(B)) when the rest of the parameters (P(A,B) and P(B) or P(A)) remain unchanged.

Below is an analysis of the 5 measures based on the above 3 properties:

Ppa-table5.jpg

Based on the analysis of the 3 properties, the team has decided upon Lift and Leverage as the analysis measures used .

Lift, Interest. Lift measures how many more times are two products likely to be bought together as compared to the products being bought together if they were independent. A lift of 1 represents independence, and the results approaches infinity if a product is completely dependent. Any number greater than 1 indicates dependency between the two products. Consider the following example:

Ppa-table6.jpg

The probability of Bread and Coffee being bought together is 6 times more likely than them being bought if they were independent of each other. Assume that there were 100,000 transactions, 500 of them contains the Bread and Coffee. Their individual probabilities are 0.0235 (Bread) and 0.0351 (Coffee), and if there were totally independent, the probability of them occurring together would be 0.0235 x 0.0351 = 0.000826. The number of transactions that Bread and Coffee were to appear if they were totally independent would be approximately 83 times. The actual number of transactions is 6 times more than the expected value if the two products were independent. This means that when either product is bought, it is 6 times more likely that the other product is bought than by the other product’s individual probability.

Leverage, Piatetsky-Shapiro Measure (PS). Similar to lift, leverage measures the additional probability of products X and Y being bought together over the probability of products X and products Y being bought independently. If the additional probability is 0 or lower, it shows that the purchase of these two products are independent. If the result is near to 1 then it is an indication of association between the two products.

Consider the above example, the leverage is 0.00417. This means that the actual increase in probability would be 0.004 or the actual increase in number of transactions is 417 (out of 100,000 transactions).

While both lift and leverage seems to provide a similar implication, leverage provides a clearer business implication in the actual increase in probability based on the popularity of selected products.

Analysis Measures for Sets

With regards to the assortments within sets, the team found that the two selected measures lift and leverage is not applicable. This is simply because the products are statistically dependent i.e. side dishes have to be bought together with a main dish. As such, the team considered the conditional probability of main dishes being bought together with sides and analysed the popularity of these combinations.