ANLY482 AY2016-17 T1 Group5 - Data Analysis

From Analytics Practicum
Revision as of 16:48, 16 October 2016 by Shiqi.chen.2012 (talk | contribs)
Jump to navigation Jump to search

CafeAnalyticsLogo.png

CafeAnalytics Home button.png CafeAnalytics Aboutus button.png CafeAnalytics Projectoverview button.png CafeAnalyticsDataanalysis button active.png CafeAnalytics Projectmanagement button.png CafeAnalytics Documentation button.png
Data Cleaning Data Exploration Dashboard


Summary

NOTE: Due to the sensitive nature of the data, we will summarize our data cleaning process so as to uphold the NDA. A separate document with more details will be submitted on eLearn.

Item Table

Plu are supposed to be unique but duplicate plu was found in the item table. The team clarify this with the sponsor and understand that this happen because the store side added in the plu without checking that the plu already exist in the system. As there are only 8 of such instance, the sponsor said that the team could manually edit the plu to some other value.

Order Table

In the Order table, we discovered only 1 outlier in the created_timestamp column where the data is from the year 2019. Since this is obviously a data entry error, we removed it from our table. There were 2 other outliers from the food_preparation_duration column where the time taken to prepare the food took more than 600 minutes. In the business context of our dataset, that is not possible. Hence, these rows are removed from the table.

In the total column, there were 2 outliers with the values 500. To verify if this is valid, the team cross-referenced the data to order_item_parent table using the order_id variable. The data rows are missing from the latter table and hence, deeming these data rows to be invalid. These rows could be created as testing data.

Order_item_parent Table

In the order_item_parent table, there were 2 outliers from the quantity column with values 15 and 20. Upon further investigation by checking the item_name, the team decided that this is relevant and should be retained in the data due to the nature of the item that is purchased.

There were 1465 missing PLU codes in this table. After discussing with the sponsor, the team understands that some of the data rows with missing PLU codes are open items that are created on an ad-hoc basis. To verify if all of these data were open items, the team cross-referenced these data rows with their item_id and item_name. Open items should not have item_id as well. However, the team found that only 125 of these selected data are open items. The other data rows with item_ids are cross-referenced and updated with the other present data in the table by matching their item_ids.

Order_item_parent_option Table

In the last table, the team removed 5 rows with null item_id as upon discussion with the sponsor, the team decided that these rows are data entry errors and are anomalies. There were also 213 missing PLU values. Again, to verify whether these data rows are still relevant, the team cross-referenced the other variables. We find that the item_ids are not missing and hence update theses rows’ PLU codes by matching the item_ids with existing data rows in the table. After cleaning, there are no missing values for PLU column.