ANLY482 AY2016-17 T1 Group5 - Data Manipulation
Data Cleaning | Data Exploration | Storyboard | Data Manipulation | Dashboard | Insights |
Contents
Sales Performance Analysis
The final data use for Sala Analysis consist of the following fields:
Variable | Data Format | Variable Type |
---|---|---|
date | Numeric | Continuous |
close | Numeric | Continuous |
The date will be formatted using Javascript to extract the month and day for use in the dashboard. Close refers to the total sales amount generated for the day and Excel pivot table is used to aggregate this value.
Association Analysis
The final data we use for Association Analysis consists of the following fields:
Variable | Data Format | Variable Type |
---|---|---|
categories | String | Nominal |
size | Numeric | Continuous |
date | Numeric | Continuous |
The categories variable refers to the combination of categories ordered e.g 31-32-end and size refers to the number of orders containing this set of combination. Using JMP Pro, we first perform a simple distribution analysis of order_id as the y-axis and grouping them by the category_id. This shows us the category_id for each order_id. Using JMP Pro, we can right click the results of the distribution and convert it into the data table. We then replace the count with the category_id itself as we are only interested if the category_id is present or not. Next, we remove all rows with only a single category_id as there is no association analysis for those orders.
This will give us the following outcome:
We then use Microsoft Excel to perform a =CONCATENATE() of all category_id columns which produces the following final data:
Each categories variable has an additional “-end” due to a limitation of the library. Without the –end, the library is not able to create hierarchies of items with multiple children nodes.
Productivity Analysis
The final data use for Productivity Analysis consist of the following fields:
Variable | Data Format | Variable Type |
---|---|---|
categoryid | String | Nominal |
formattedDate | String | Nominal |
fulldate | Numeric | Continuous |
PrepTime | Numeric | Continuous |
Hour | Numeric | Continuous |
Using JMP Pro, we extracted date and time from created timestamp and modified timestamp. From the extracted Time variable, we extracted the hour to a new variable named Hour. Next, we get the day of the week from the date. To display the day in an orderly way in our dashboard, we created a formattedDate variable. The formattedDate variable value example is “1.Sun”. As this is a relatively larger dataset, we created this variable instead of using the code to do it as compared to the data for sales. The PrepTime variable is obtained by subtracting the time of the modified_timestamp with the time of the created_timestamp.
Popularity Item and Modifier Analysis
The final data we use for Association Analysis consists of the following fields:
Variable | Data Format | Variable Type |
---|---|---|
items | String | Nominal |
size | Numeric | Continuous |
date | Numeric | Continuous |
Similar to the Association Analysis set, the items refer to the combination of option_plus (i.e item modifier) tied to each item_parent and the size referring to the number of orders containing that set of combination. Unlike the association analysis data, there are too many item_parent_plus to do a distribution analysis. Hence, we used Microsoft Excel and the =IF() function to iterate through the data. We first sort the data according to the order of: 1. Item_parent_id, 2. Item_parent_plu and 3. Option_plu. Next, using =IF(), we check if the next row of option_plu belongs to the first item_parent_id by matching the item_parent_id and item_parent_plus. This is method is repeated comparing the 1st row and the 3rd, 4th, 5th row and so on. Once there are no more option_plus being matched, the process stops and we have the following data:
Finally, like the data for Association Analysis, we use the =CONCATENATE() function and combine all option_plu columns together and added an “-end” to the end of each combination. This results in the following final dataset:
The first item in each items row is always an item_parent_plu value.