ANLY482 AY2016-17 T1 Group5 - Data Manipulation

From Analytics Practicum
Revision as of 22:30, 1 December 2016 by Shiqi.chen.2012 (talk | contribs) (→‎Popularity Item and Modifier Analysis)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

CafeAnalyticsLogo.png

CafeAnalytics Home button.png CafeAnalytics Aboutus button.png CafeAnalytics Projectoverview button.png CafeAnalyticsDataanalysis button active.png CafeAnalytics Projectmanagement button.png CafeAnalytics Documentation button.png
Data Cleaning Data Exploration Storyboard Data Manipulation Dashboard Insights


Sales Performance Analysis

The final data use for Sala Analysis consist of the following fields:

Variable Data Format Variable Type
date Numeric Continuous
close Numeric Continuous

The date will be formatted using Javascript to extract the month and day for use in the dashboard. Close refers to the total sales amount generated for the day and Excel pivot table is used to aggregate this value.

Association Analysis

The final data we use for Association Analysis consists of the following fields:

Variable Data Format Variable Type
categories String Nominal
size Numeric Continuous
date Numeric Continuous

The categories variable refers to the combination of categories ordered e.g 31-32-end and size refers to the number of orders containing this set of combination. Using JMP Pro, we first perform a simple distribution analysis of order_id as the y-axis and grouping them by the category_id. This shows us the category_id for each order_id. Using JMP Pro, we can right click the results of the distribution and convert it into the data table. We then replace the count with the category_id itself as we are only interested if the category_id is present or not. Next, we remove all rows with only a single category_id as there is no association analysis for those orders.

This will give us the following outcome:

AnalyticsCafe AA1.jpg


We then use Microsoft Excel to perform a =CONCATENATE() of all category_id columns which produces the following final data:

AnalyticsCafe AA2.jpg


Each categories variable has an additional “-end” due to a limitation of the library. Without the –end, the library is not able to create hierarchies of items with multiple children nodes.

Productivity Analysis

The final data use for Productivity Analysis consist of the following fields:

Variable Data Format Variable Type
categoryid String Nominal
formattedDate String Nominal
fulldate Numeric Continuous
PrepTime Numeric Continuous
Hour Numeric Continuous

Using JMP Pro, we extracted date and time from created timestamp and modified timestamp. From the extracted Time variable, we extracted the hour to a new variable named Hour. Next, we get the day of the week from the date. To display the day in an orderly way in our dashboard, we created a formattedDate variable. The formattedDate variable value example is “1.Sun”. As this is a relatively larger dataset, we created this variable instead of using the code to do it as compared to the data for sales. The PrepTime variable is obtained by subtracting the time of the modified_timestamp with the time of the created_timestamp.

Popularity Item and Modifier Analysis

The final data we use for Association Analysis consists of the following fields:

Variable Data Format Variable Type
items String Nominal
size Numeric Continuous
date Numeric Continuous

Similar to the Association Analysis set, the items refer to the combination of option_plus (i.e item modifier) tied to each item_parent and the size referring to the number of orders containing that set of combination. Unlike the association analysis data, there are too many item_parent_plus to do a distribution analysis. Hence, we used Microsoft Excel and the =IF() function to iterate through the data. We first sort the data according to the order of: 1. Item_parent_id, 2. Item_parent_plu and 3. Option_plu. Next, using =IF(), we check if the next row of option_plu belongs to the first item_parent_id by matching the item_parent_id and item_parent_plus. This is method is repeated comparing the 1st row and the 3rd, 4th, 5th row and so on. Once there are no more option_plus being matched, the process stops and we have the following data:

AnalyticsCafe PIMA1.jpg

Finally, like the data for Association Analysis, we use the =CONCATENATE() function and combine all option_plu columns together and added an “-end” to the end of each combination. This results in the following final dataset:

AnalyticsCafe PIMA2.jpg

The first item in each items row is always an item_parent_plu value.