ANLY482 AY2016-17 T1 Group5 - Data Analysis

From Analytics Practicum
Jump to navigation Jump to search

CafeAnalyticsLogo.png

CafeAnalytics Home button.png CafeAnalytics Aboutus button.png CafeAnalytics Projectoverview button.png CafeAnalyticsDataanalysis button active.png CafeAnalytics Projectmanagement button.png CafeAnalytics Documentation button.png
Data Cleaning Data Exploration Storyboard Data Manipulation Dashboard Insights


Data Description

Database Diagram

CafeAnalytics ER.png

Metadata

Category Table

Variable Data Format Variable Type
id String Nominal
main_category_id String Nominal
display_name String Nominal

Category_item Table

Variable Data Format Variable Type
id String Nominal
category_id String Nominal
item_id String Nominal

Item Table

Variable Data Format Variable Type
id String Nominal
plu String Nominal
display_name String Nominal
price1 Numeric Continuous
station1 String Nominal
station2 String Nominal
station3 String Nominal
station12 String Nominal

Modifier_group_item Table

Variable Data Format Variable Type
id String Nominal
item_id String Nominal
modifier_group_id String Nominal

Item_modifier_group Table

Variable Data Format Variable Type
id String Nominal
parent_item_id String Nominal
modifier_group_id String Nominal

Modifer_group Table

Variable Data Format Variable Type
id String Nominal
display_name String Nominal
price_level String Nominal

Order Table

Variable Data Format Variable Type
id String Nominal
created_timestamp Numeric Continuous
total Numeric Continuous
remarks String Nominal
Day of Week Numeric Ordinal
Day Numeric Ordinal
Month Numeric Ordinal
Month Year Numeric Ordinal
Year Numeric Continuous
Date Numeric Continuous
Hour Numeric Continuous

Order_item_parent Table

Variable Data Format Variable Type
id String Nominal
order_id String Nominal
quantity Numeric Continuous
remarks String Nominal
item_name String Nominal
modified_timestamp Numeric Continuous
item_price Numeric Continuous
item_id String Nominal
plu String Nominal

Order_item_parent_option Table

Variable Data Format Variable Type
id String Nominal
item_parent_id String Nominal
option_name String Nominal
modified_timestamp Numeric Continuous
item_price Numeric Continuous
item_id String Nominal
plu String Nominal

Full List Table

Variable Data Format Variable Type
order_id String Nominal
item_parent_id String Nominal
option_id String Nominal
item_parent_name String Nominal
item_parent_itemID String Ordinal
option_itemID String Nominal
item_parent_plu String Nominal
option_plu String Nominal
quantity Numeric Continuous
item_price Numeric Continuous
item_parent_modified_timestamp Numeric Continuous
option_modified_timestamp Numeric Continuous
created_timestamp Numeric Continuous
total Numeric Continuous
category_id String Nominal
main_category_id String Nominal
display_nameofcategory String Nominal
Date[created_timestamp] Numeric Continuous
Time of Day[created_timestamp] Numeric Continuous
Date[item_parent_modified_timestamp] Numeric Continuous
Time of Day[item_parent_modified_timestamp] Numeric Continuous
Date[option_modified_timestamp] Numeric Continuous
Time of Day[option_modified_timestamp] Numeric Continuous
food_prep_duration Numeric Continuous

Summary

NOTE: Due to the sensitive nature of the data, we will summarize our data cleaning process so as to uphold the NDA. A separate document with more details will be submitted on eLearn.

Item Table

Plu are supposed to be unique but duplicate plu was found in the item table. The team clarify this with the sponsor and understand that this happen because the store side added in the plu without checking that the plu already exist in the system. As there are only 8 of such instance, the sponsor said that the team could manually edit the plu to some other value.

Order Table

In the Order table, we discovered only 1 outlier in the created_timestamp column where the data is from the year 2019. Since this is obviously a data entry error, we removed it from our table. There were 2 other outliers from the food_preparation_duration column where the time taken to prepare the food took more than 600 minutes. In the business context of our dataset, that is not possible. Hence, these rows are removed from the table.

In the total column, there were 2 outliers with the values 500. To verify if this is valid, the team cross-referenced the data to order_item_parent table using the order_id variable. The data rows are missing from the latter table and hence, deeming these data rows to be invalid. These rows could be created as testing data.

Order_item_parent Table

In the order_item_parent table, there were 2 outliers from the quantity column with values 15 and 20. Upon further investigation by checking the item_name, the team decided that this is relevant and should be retained in the data due to the nature of the item that is purchased.

There were 1465 missing PLU codes in this table. After discussing with the sponsor, the team understands that some of the data rows with missing PLU codes are open items that are created on an ad-hoc basis. To verify if all of these data were open items, the team cross-referenced these data rows with their item_id and item_name. Open items should not have item_id as well. However, the team found that only 125 of these selected data are open items. The other data rows with item_ids are cross-referenced and updated with the other present data in the table by matching their item_ids.

Order_item_parent_option Table

In the last table, the team removed 5 rows with null item_id as upon discussion with the sponsor, the team decided that these rows are data entry errors and are anomalies. There were also 213 missing PLU values. Again, to verify whether these data rows are still relevant, the team cross-referenced the other variables. We find that the item_ids are not missing and hence update theses rows’ PLU codes by matching the item_ids with existing data rows in the table. After cleaning, there are no missing values for PLU column.