Difference between revisions of "ANLY482 AY2017-18T2 Group10 Project Overview: Data"

From Analytics Practicum
Jump to navigation Jump to search
 
(19 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
<center>
 
<center>
[[File:datayselogo.png|500px]]
+
[[File:tennet_logo.png|200px]]
 
</center>
 
</center>
  
Line 35: Line 35:
 
</center>
 
</center>
 
<!------- End of Secondary Navigation Bar---->
 
<!------- End of Secondary Navigation Bar---->
 +
 +
'''''Due to confidentiality, we will not be able to upload any charts onto the wiki. The fully disclosed analysis report is available in our Interim Report submission.'''''
  
 
<!-- Body -->
 
<!-- Body -->
Line 40: Line 42:
  
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
The data provided by the sponsor is in Microsoft Excel format for each  outlet by month. For now, they have provided the data for a total of 12 months from Dec 2016 to Nov 2017. One limitation is that the company has recently changed the format of the inventory data, and thus we would be working with 2 different formats of inventory data.
 
  
 +
The data provided by the sponsor is in Microsoft Excel format for each outlet by month. The team has used Python for the cleaning and preparation of the data. For now, they have provided the data for a total of 24 months from Dec 2015 to Dec 2017. The data that was given to us are Inventory Data, Monthly PLU (Programmable Logic Unit) and Sales Data. One limitation is that the company has recently changed the format of the inventory data, and thus we would be working with 2 different formats of inventory data. Below is a short description of the each dataset:
 +
 +
 +
<table border=1>
 +
<tr><th>Dataset Name</th><th>Dataset Description</th></tr>
 +
<tr><td>Inventory</td><td>Describes the inventory order for each outlet the data is updated daily.</td></tr>
 +
<tr><td>Sales</td><td>Describes the sales for each outlet for each month daily.</td></tr>
 +
<tr><td>DailyPLU</td><td>Describes the number of patrons for each outlet according to the type of meal daily. </td></tr>
 +
<tr><td>Purchase Data</td><td>Describes the number of ingredients ordered each outlet daily. </td></tr>
 +
</table>
 +
 +
 +
It is noted that the sponsor only performs stock taking once a month and hence only monthly inventory data is available. This means that there are no daily or weekly stock levels recorded. However, the objective of the project is to be able to forecast the demand for each ingredient on a daily basis. The proposed idea is to first forecast the demand for a particular month. Subsequently, we will use the PLU data which contains the number of pax daily to breakdown the forecasted monthly usage into a daily value.
 
</div>
 
</div>
  
<br/>
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Purchase Data Preparation Process</strong></font></div>==
 +
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 +
Purchase refers to the amount of ingredients ordered by each outlet on a daily basis. The data that we have is from December 2015 to December 2017. For the purchase data file, we identified many errors with the data.<br/>
 +
 
 +
Firstly, the date cannot be used due to the way it’s structured, this is because for data analytics, one row should only contain one critical information and not many critical information. In this case, there were many critical information, i.e. for example the quantity and amount purchase in 1st Jan, 2nd Jan, etc. Secondly,  ProductCode and ProductName have missing data. Upon clarification with our sponsor, we were informed that we could ignore these rows. Additionally, the purchase data would be used to merge with inventory data. Lastly, not all rows has OutletNameLocation or SupplierName and removed all OutletNameLocations that is irrelevant to our project. As such, our group removed missing and irrelevant data, filled up any empty rows in OutletNameLocation and SupplierName, melted the data according to date such that it could be analysed, and we concatenated all the files into a single file so that we could analyse all the months. The diagram below has been provided to help visualise the entire data cleaning process.
 +
[[File:Tennet Purchase Cleaning diagram.png|border|center|300px]]
 +
 
 +
</div>
  
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Metadata (Inventory)</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Inventory Data Preparation Process</strong></font></div>==
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
<span>For each month and every outlet, the sponsor uses an Excel file to record various information of the inventory. These include:</span>
+
As there was a change in format of the Monthly Inventory Data from October 2017 onwards, there are two main different types of formats for the Monthly Inventory Data. The two different formats have different column names and different number of columns. Hence, to perform our analysis and EDA, we had to process the two formats separately. Using Python scripts, we extracted the necessary columns from each file, standardised the column names and compiled them into a giant CSV data file - ‘Inventory_Processed_2016-2017.csv’
 +
[[File:Tennet inventory processing diagram.png|border|center|600px]]
  
<table border=1>
 
<tr><th>Column Name</th><th>Description</th></tr>
 
<tr><td>Outlet</td><td>Outlet ordering the product</td></tr>
 
<tr><td>Supplier</td><td>Supplier from which product is being ordered from</td></tr>
 
<tr><td>Product Code</td><td>Product unique identifier</td></tr>
 
<tr><td>Product Name</td><td>Name of product</td></tr>
 
<tr><td>Packaging</td><td>How product is packaged and sold. E.g. by KG, PKT or GRAMS</td></tr>
 
<tr><td>Case</td><td>Packet, Container or Box</td></tr>
 
<tr><td>CaseBal</td><td>Balance of Cases</td></tr>
 
<tr><td>Unit</td><td>KG, PKT or GRAM</td></tr>
 
<tr><td>UnitBal</td><td>Balance of individual units</td></tr>
 
<tr><td>PriceByCash</td><td>Unit cost for product</td></tr>
 
<tr><td>LastMthValue</td><td>Dollar value of product from previous month quantity</td></tr>
 
<tr><td>LastMthCase</td><td>Dollar value of product from previous month quantity</td></tr>
 
<tr><td>CaseOpnStk</td><td>Amount of product from previous month in terms of case</td></tr>
 
<tr><td>CasePurQTY</td><td>Amount of cases bought in current month</td></tr>
 
<tr><td>UnitBalCal</td><td>Combined balanced in terms of unit</td></tr>
 
<tr><td>CaseBalCombine</td><td>Combined quantity from both cases and individual units</td></tr>
 
<tr><td>CaseUsageQTY</td><td>Quantiy of cases used in month calculated by [CaseOpnStk+CasePurQTY-CaseBalCombine]</td></tr>
 
<tr><td>CaseUsageAmt</td><td>Amount used in dollars</td></tr>
 
<tr><td>AcctTerm</td><td>Category of product mainly for accounting purposes (Beverage, Food, etc)</td></tr>
 
<tr><td>Counter</td><td>Category of product (Vegetable, Sauce, Sushi)</td></tr>
 
<tr><td>Type</td><td>OutType of product (Vegetable, Sauce, Read-to-eat)let</td></tr>
 
<tr><td>CaseBalCombine$</td><td>CaseBalCombine in terms of dollars</td></tr>
 
</table>
 
 
</div>
 
</div>
<br/>
 
  
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Metadata (Sales)</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Sales Data Preparation Process</strong></font></div>==
 +
 
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
Sales, which is exclusive of tax, is calculated using the formula, "Sales = Net Sales + Service Charge". For their sales value, they have the following data:
+
Sales data refers to the daily sales that ABC Company everyday for each outlet broken down to various categories, like Card, Nets, cash, etc. For the sales data, we have the data between December 2015 to November 2017.
<table border=1>
+
 
<tr><th>Column Name</th><th>Description</th></tr>
+
Since the sales data is relatively clean and ready to be analysed, there is very little to do. Our group only prepared the data by adding another calculated column called “revenue”, which is the sum of column “nett_sales” and “service”, and we concatenated all 4 excel files into one single file. Using Python scripts, we standardised the column names and compiled them into a giant CSV data file.
<tr><td>Sales_dt</td><td>the sales date in format (dd/mm/yyyy)</td></tr>
 
<tr><td>Sales_day</td><td>the day of of the sales</td><tr>
 
<tr><td>Loc_id</td><td>the location ID</td></tr>
 
<tr><td>Loc_name</td><td>the location name</td></tr>
 
<tr><td>G_sales</td><td>gross sales</td><tr>
 
<tr><td>Discount</td><td>discount amount</td></tr>
 
<tr><td>Nett_sales</td><td>the total nett of sales (nett_sales = income - service - tax)</td></tr>
 
<tr><td>Service</td><td>the amount of service charge borne by the customer</td></tr>
 
<tr><td>Tax</td><td>the amount of goods & services tax paid by the customer</td></tr>
 
<tr><td>Income</td><td>the total amount of income collected (income = cash + cards + nets + voucher)</td></tr>
 
<tr><td>Cash</td><td>the amount of income from cash gained</td></tr>
 
<tr><td>Cards</td><td>the amount of income from cards gained</td></tr>
 
<tr><td>Nets</td><td>the amount of income from nets gained</td></tr>
 
<tr><td>Voucher</td><td>the amount of income from voucher gained</td></tr>
 
<tr><td>Pax</td><td>the number of people served</td></tr> 
 
<tr><td>Tv_count</td><td>transaction void count</td></tr>     
 
<tr><td>Tv_amt</td><td>transaction void amount</td></tr> 
 
<tr><td>Country</td><td>the country of the sale</td></tr> 
 
</table>
 
 
</div>
 
</div>
<br/>
+
 
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>Metadata (Monthly Programming logic unit)</strong></font></div>==
+
==<div style="background: #ffffff; padding: 17px;padding:0.3em; letter-spacing:0.1em; line-height: 0.1em;  text-indent: 10px; font-size:17px; text-transform:uppercase; font-weight: light; font-family: 'Century Gothic';  border-left:8px solid #1b96fe; margin-bottom:5px"><font color= #000000><strong>PLU Data Preparation Process</strong></font></div>==
 +
 
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
 
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Century Gothic, Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left; font-size: 15px">
The monthly programming logic unit displays the number of people who patronize Seoul Garden by month and by location, below is the metadata for this data:
+
Daily PLU refers to the number of patrons that each outlet has daily, in various categories (Adult, Child, Student, Tourist, Senior and FOC Pax) There were several issues with the data. Firstly, all the days are in the same row, which makes it hard to analyse the data. In addition, there is a column “Dept_Type2” in the sample data which is irrelevant to us, for example “Soup Base”. According to our sponsor, we are only interested in 6 categories (Adult, Student, Senior, Tourist, Child and FOC PAX). There are also many different rows of the same category due to the irrelevant description. For example, an adults from an outlet can have different descriptions, such as “Wkend Adult Dinner Buffet” or ‘Wkday Adult Dinner Buffet”, since this is irrelevant to us in the analysis, we would be grouping all the categories into one row for each outlet. After, cleaning all the data issues, we concatenate the various Daily PLU data into one file to make it easier to analyse the daily patrons. In addition, since we wanted to work with the weekly day, we should sum the total customer count in a week and group them together.
<table border=1>
+
 
<tr><th>Column Name</th><th>Description</th></tr>
+
From our exploratory data analysis, we realised that on some days, we have 0 values. After speaking to our sponsor about this, he informed us that these days have 0 values because of private events, company events of cleaning days where they would not be serving any customers. Since these 0 values would affect our analysis, we would be getting the 2 year average by outlet, customer group and day of week and replace these 0 values accordingly.
<tr><td>Sales_dt</td><td>sale</td></tr>
+
 
<tr><td>Outlet</td><td>the outlet number</td></tr>
+
As such, we will be cleaning the data using the following steps:
<tr><td>Outletdesc</td><td>the name of the outlet</td></tr>
+
 
<tr><td>Transtype</td><td>Transaction Type (Dine-in, Takeaway, HouseAccount)</td></tr>
+
#Remove irrelevant columns. Removing columns that are irrelevant for our analysis such as transaction type, group_id and dept_id.
<tr><td>Group_id</td><td>grouping of type of PLU, setting product purposes</td></tr>
+
#Unpivot the data. Each row contains the number of customers of multiple days, we unpivot the data such that each row contains the number of customers for only one day.
<tr><td>Dept_id</td><td>grouping PLU for POS interface</td></tr>
+
#Filter out irrelevant rows. In the column ‘Dept_Type2’, we are only concerned with Adult data. However, there are other categories such as Chope, FOC Pax and 12 other columns
<tr><td>Shiftcode</td><td>the shift of the day that the transaction took place in</td></tr>
+
#Including important columns. Introduce columns such as week number and year to give our models more features to analyse
<tr><td>Plu</td><td>the lookup code for the transaction</td></tr>
+
#Impute 0s with 2-year average. Calculate average for each outlet and category of customer for each day of the week and impute the respective value of the specific day of the week to rows with 0 values.
<tr><td>Desc</td><td>the description of the transaction</td></tr>
+
#Group by the week number. Since we will be forecasting weekly customer count, we sum the customer counts that belong in the same week together.
<tr><td>Totalqty</td><td>the sum of the total of the type of transactions for the whole year</td></tr>
 
<tr><td>Unitprice</td><td>the price charged for one unit of a type of transaction</td></tr>
 
<tr><td>Totalamt</td><td>the product of totalqty and unitprice</td></tr>
 
<tr><td>Country</td><td>the country the outlet is located in</td></tr>
 
<tr><td>Concept</td><td>the type of restaurant concept</td></tr>
 
<tr><td>OutletName</td><td>the name of the outlet</td></tr>
 
<tr><td>Company</td><td>the parent company of the outlet</td></tr>
 
<tr><td>Shift</td><td>description of the shiftcode</td></tr>
 
<tr><td>Dept_Type</td><td>Department type</td></tr>
 
<tr><td>Dept_Type2</td><td>Product type</td></tr>
 
<tr><td>Dept_Type3</td><td>day type</td></tr>
 
<tr><td>Shift2</td><td>lunch / dinner shift</td></tr>
 
</table>
 
 
</div>
 
</div>
 
<!-- End Body --->
 
<!-- End Body --->

Latest revision as of 13:38, 15 April 2018

Tennet logo.png


HOME

ABOUT US

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

BACK TO MAIN ANLY482

Overview

Data

Methodology

Due to confidentiality, we will not be able to upload any charts onto the wiki. The fully disclosed analysis report is available in our Interim Report submission.

Data Overview

The data provided by the sponsor is in Microsoft Excel format for each outlet by month. The team has used Python for the cleaning and preparation of the data. For now, they have provided the data for a total of 24 months from Dec 2015 to Dec 2017. The data that was given to us are Inventory Data, Monthly PLU (Programmable Logic Unit) and Sales Data. One limitation is that the company has recently changed the format of the inventory data, and thus we would be working with 2 different formats of inventory data. Below is a short description of the each dataset:


Dataset NameDataset Description
InventoryDescribes the inventory order for each outlet the data is updated daily.
SalesDescribes the sales for each outlet for each month daily.
DailyPLUDescribes the number of patrons for each outlet according to the type of meal daily.
Purchase DataDescribes the number of ingredients ordered each outlet daily.


It is noted that the sponsor only performs stock taking once a month and hence only monthly inventory data is available. This means that there are no daily or weekly stock levels recorded. However, the objective of the project is to be able to forecast the demand for each ingredient on a daily basis. The proposed idea is to first forecast the demand for a particular month. Subsequently, we will use the PLU data which contains the number of pax daily to breakdown the forecasted monthly usage into a daily value.

Purchase Data Preparation Process

Purchase refers to the amount of ingredients ordered by each outlet on a daily basis. The data that we have is from December 2015 to December 2017. For the purchase data file, we identified many errors with the data.

Firstly, the date cannot be used due to the way it’s structured, this is because for data analytics, one row should only contain one critical information and not many critical information. In this case, there were many critical information, i.e. for example the quantity and amount purchase in 1st Jan, 2nd Jan, etc. Secondly, ProductCode and ProductName have missing data. Upon clarification with our sponsor, we were informed that we could ignore these rows. Additionally, the purchase data would be used to merge with inventory data. Lastly, not all rows has OutletNameLocation or SupplierName and removed all OutletNameLocations that is irrelevant to our project. As such, our group removed missing and irrelevant data, filled up any empty rows in OutletNameLocation and SupplierName, melted the data according to date such that it could be analysed, and we concatenated all the files into a single file so that we could analyse all the months. The diagram below has been provided to help visualise the entire data cleaning process.

Tennet Purchase Cleaning diagram.png

Inventory Data Preparation Process

As there was a change in format of the Monthly Inventory Data from October 2017 onwards, there are two main different types of formats for the Monthly Inventory Data. The two different formats have different column names and different number of columns. Hence, to perform our analysis and EDA, we had to process the two formats separately. Using Python scripts, we extracted the necessary columns from each file, standardised the column names and compiled them into a giant CSV data file - ‘Inventory_Processed_2016-2017.csv’

Tennet inventory processing diagram.png

Sales Data Preparation Process

Sales data refers to the daily sales that ABC Company everyday for each outlet broken down to various categories, like Card, Nets, cash, etc. For the sales data, we have the data between December 2015 to November 2017.

Since the sales data is relatively clean and ready to be analysed, there is very little to do. Our group only prepared the data by adding another calculated column called “revenue”, which is the sum of column “nett_sales” and “service”, and we concatenated all 4 excel files into one single file. Using Python scripts, we standardised the column names and compiled them into a giant CSV data file.

PLU Data Preparation Process

Daily PLU refers to the number of patrons that each outlet has daily, in various categories (Adult, Child, Student, Tourist, Senior and FOC Pax) There were several issues with the data. Firstly, all the days are in the same row, which makes it hard to analyse the data. In addition, there is a column “Dept_Type2” in the sample data which is irrelevant to us, for example “Soup Base”. According to our sponsor, we are only interested in 6 categories (Adult, Student, Senior, Tourist, Child and FOC PAX). There are also many different rows of the same category due to the irrelevant description. For example, an adults from an outlet can have different descriptions, such as “Wkend Adult Dinner Buffet” or ‘Wkday Adult Dinner Buffet”, since this is irrelevant to us in the analysis, we would be grouping all the categories into one row for each outlet. After, cleaning all the data issues, we concatenate the various Daily PLU data into one file to make it easier to analyse the daily patrons. In addition, since we wanted to work with the weekly day, we should sum the total customer count in a week and group them together.

From our exploratory data analysis, we realised that on some days, we have 0 values. After speaking to our sponsor about this, he informed us that these days have 0 values because of private events, company events of cleaning days where they would not be serving any customers. Since these 0 values would affect our analysis, we would be getting the 2 year average by outlet, customer group and day of week and replace these 0 values accordingly.

As such, we will be cleaning the data using the following steps:

  1. Remove irrelevant columns. Removing columns that are irrelevant for our analysis such as transaction type, group_id and dept_id.
  2. Unpivot the data. Each row contains the number of customers of multiple days, we unpivot the data such that each row contains the number of customers for only one day.
  3. Filter out irrelevant rows. In the column ‘Dept_Type2’, we are only concerned with Adult data. However, there are other categories such as Chope, FOC Pax and 12 other columns
  4. Including important columns. Introduce columns such as week number and year to give our models more features to analyse
  5. Impute 0s with 2-year average. Calculate average for each outlet and category of customer for each day of the week and impute the respective value of the specific day of the week to rows with 0 values.
  6. Group by the week number. Since we will be forecasting weekly customer count, we sum the customer counts that belong in the same week together.