Difference between revisions of "Uncovering Market-Insights for Charles & Keith: Data Cleaning"

From Analytics Practicum
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 29: Line 29:
 
{| style="background-color:#ffffff ; margin: 3px 11px 3px 11px;" width="80%"|
 
{| style="background-color:#ffffff ; margin: 3px 11px 3px 11px;" width="80%"|
 
| style="font-family:Trebuchet MS; font-size:11px; text-align: center; border:solid 1px #f5f5f5; background-color: #f5f5f5" width="200px" |   
 
| style="font-family:Trebuchet MS; font-size:11px; text-align: center; border:solid 1px #f5f5f5; background-color: #f5f5f5" width="200px" |   
[[Uncovering Market-Insights for Charles & Keith: Data Preparation|<font color="#3c3c3c"><strong>INTIAL DATASET </strong></font>]]
+
[[Uncovering Market-Insights for Charles & Keith: Data Preparation|<font color="#3c3c3c"><strong>INITIAL DATASET </strong></font>]]
  
 
| style="font-family:Trebuchet MS; font-size:11px; text-align: center; border-top:solid #f5f5f5; background-color: #fff" width="200px" |  
 
| style="font-family:Trebuchet MS; font-size:11px; text-align: center; border-top:solid #f5f5f5; background-color: #fff" width="200px" |  
Line 54: Line 54:
 
[[File:Country.jpg|none| ]]
 
[[File:Country.jpg|none| ]]
 
Since the dataset is the record from China, our group has decided to remove the attribute “Country”.
 
Since the dataset is the record from China, our group has decided to remove the attribute “Country”.
 +
  
 
<b>TransactionId</b>
 
<b>TransactionId</b>
 +
 
[[File:TID.png|none| ]]
 
[[File:TID.png|none| ]]
 
For attribute “TransactionId”, since it is a identifier, our group changed the data type from Numeric to Character.
 
For attribute “TransactionId”, since it is a identifier, our group changed the data type from Numeric to Character.
 +
  
 
<b>Date</b>
 
<b>Date</b>
  
 +
[[File:Date.png|none| ]]
 
To prepare the data correctly, the Date attribute has to be changed. Using JMP Pro, we changed the setting of Date, Data Type from “Character” to “Numeric”. We also set the date format to “m/d/y”.
 
To prepare the data correctly, the Date attribute has to be changed. Using JMP Pro, we changed the setting of Date, Data Type from “Character” to “Numeric”. We also set the date format to “m/d/y”.
  
[[File:Date.png|none| ]]
 
  
<b>Materials</b>
+
<b>Class</b>
 +
 
 +
For attribute “CLASS”, upon further investigation, our group realise that PASSPORT HOLDER and PP HOLDER are the same thing. The same could be said about SHOULDER and SHOULDER BAG as well as SLING and SLING BAG. Hence, We recoded this names, to ensure that our analysis will be accurate in the future.
 +
[[File:Class.png|none| ]]
 +
Hence, We recoded these names, to ensure that our analysis will be accurate in the future.
  
[[File:Materials.png|none| ]]
 
In the figure below, even though Ankle Boot and ANKLEBOOT are the same name, they are classified differently. This also applies to Ballerina and BALLERINA. Hence, our group has recoded attribute “Material” into a separate column named “Material 2” to ensure that materials of the same name are being grouped together
 
  
 
<b>Subclass</b>
 
<b>Subclass</b>
  
Besides Materials, similar recoding work was also done to attribute “Subclass”, “Class” and “Size”. For “Subclass”, the PF in PF COVERED, PF OPEN TOE and PF PEEP TOE are all referring to PLATFORM.  
+
Besides CLASS, similar recoding work was also done to attribute “Subclass”, “Size” and “Material”. For “Subclass”, the PF in PF COVERED, PF OPEN TOE and PF PEEP TOE are all referring to PLATFORM. Hence, we replaced all “PF” to “PLATFORM”.
 
[[File:Subclass.png|none| ]]
 
[[File:Subclass.png|none| ]]
 
Hence, We replaced all PF to PLATFORM.  
 
Hence, We replaced all PF to PLATFORM.  
 
<b>Class</b>
 
 
For attribute “CLASS”, upon further investigation, our group realise that PASSPORT HOLDER and PP HOLDER are the same thing. The same could be said about SHOULDER and SHOULDER BAG as well as SLING and SLING BAG.
 
[[File:Class.png|none| ]]
 
Hence, We recoded this names, to ensure that our analysis will be accurate in the future.
 
  
 
<b>Size</b>
 
<b>Size</b>
  
For attribute “SIZE”, all the numerical values belong to Shoe sizes, while the rest refers to accessory sizes such as Necklace, Bags and Wallets etc. To prevent any confusion, our group recoded the size from “340” to “34” for all shoes sizes.  
+
For attribute “SIZE”, all the numerical values belong to Shoe sizes, while the rest refers to accessory sizes such as Necklace, Bags and Wallets etc. To prevent any confusion, our group re-coded the size from “340” to “34” for all shoes sizes.  
 
[[File:Size.png|none| ]]
 
[[File:Size.png|none| ]]
 +
 +
<b>Materials</b>
 +
[[File:Materials.png|none| ]]
 +
In the figure below, even though Ankle Boot and ANKLEBOOT are the same name, they are classified differently. This also applies to Ballerina and BALLERINA. Hence, our group has re-coded attribute “Material” into a separate column named “Material 2” to ensure that materials of the same name are being grouped together. For example, by grouping the name respectively, ankle boots will have a total of 30859 rows which is a combination of Ankle Boot and ANKLEBOOT. The same has been done to Subclass and Class attributes.
  
 
<b> TransactionStoreID</b>
 
<b> TransactionStoreID</b>
 
To ensure that our market basket analysis to be accurate for the next phase of our practicum, our group has created a new variable name that is unique. The attribute “TransactionId” is not a unique identifier for each row of data because different StorName of different Region could have used the same TransactionID. Hence, our group has concatenated the StoreName and TransactionId to create a unique identifier.
 
 
[[File:TSID.jpg|none| ]]
 
[[File:TSID.jpg|none| ]]
 +
To ensure that our market basket analysis to be accurate for the next phase of our practicum, our group has created a new variable name that is unique. The attribute “TransactionId” is not a unique identifier for each row of data because different Store Name of different Region could have used the same TransactionID. Hence, our group has concatenated the StoreName and TransactionId to create a unique identifier.
 
</div>
 
</div>
  
  
 
<div style="border-style: solid solid none; border-color: #35383c; border-width: 1px 1px; padding: 5px; font-size: 120%; font-weight: bold; background-color: #{{LibreOfficeColor2}}; color: #{{LibreOfficeColor3}}; border-radius: 3px 3px 0 0;">PHASE 2</div>
 
<div style="border-style: solid solid none; border-color: #35383c; border-width: 1px 1px; padding: 5px; font-size: 120%; font-weight: bold; background-color: #{{LibreOfficeColor2}}; color: #{{LibreOfficeColor3}}; border-radius: 3px 3px 0 0;">PHASE 2</div>
 +
<div style="border: 1px solid #35383c; padding: 15px 15px 20px; border-radius: 0 0 3px 3px;">
 +
<b>Merging of Area Groups</b>
 +
[[File:MergeAreaGroups.png|none]]
 +
During Phase 2 of Exploratory Data Analysis, a new grouping of store, Area Group, was given. To combine this with our existing dataset, we use the join function in JMP and match the store name from the area group to our existing store name in the dataset. Looking at the data below, we can observed that different stores of different city region can be considered as the same area group. Hence, giving us another dimension of analysis.
 +
 +
<b>Sales Quantity vs Ordered Quantity</b>
 +
[[File:AYEOrderQ.png|none]]
 +
The order quantity per article has been provided as well. We have included it with the comparison of the sales quantity to see the conversion rate.
 +
</div>

Latest revision as of 20:48, 17 April 2016

DATAPREP.jpg
HOME   OVERVIEW   DATA PREPARATION   ANALYSIS   PROJECT MANAGEMENT   DOCUMENTATION


INITIAL DATASET

DATA CLEANING

FINAL DATASET

AYE InitialDataset


PHASE 1

Invalid Transactions

InvalidTransaction.png

For attribute “SaleQty”, it contains negative value from 0 to -25. Upon confirming with our sponsors, these sales quantity are actually recorded wrongly and sale have taken place. Hence, our group removed all the sales quantity that are less than 1.


Country

Country.jpg

Since the dataset is the record from China, our group has decided to remove the attribute “Country”.


TransactionId

TID.png

For attribute “TransactionId”, since it is a identifier, our group changed the data type from Numeric to Character.


Date

Date.png

To prepare the data correctly, the Date attribute has to be changed. Using JMP Pro, we changed the setting of Date, Data Type from “Character” to “Numeric”. We also set the date format to “m/d/y”.


Class

For attribute “CLASS”, upon further investigation, our group realise that PASSPORT HOLDER and PP HOLDER are the same thing. The same could be said about SHOULDER and SHOULDER BAG as well as SLING and SLING BAG. Hence, We recoded this names, to ensure that our analysis will be accurate in the future.

Class.png

Hence, We recoded these names, to ensure that our analysis will be accurate in the future.


Subclass

Besides CLASS, similar recoding work was also done to attribute “Subclass”, “Size” and “Material”. For “Subclass”, the PF in PF COVERED, PF OPEN TOE and PF PEEP TOE are all referring to PLATFORM. Hence, we replaced all “PF” to “PLATFORM”.

Subclass.png

Hence, We replaced all PF to PLATFORM.

Size

For attribute “SIZE”, all the numerical values belong to Shoe sizes, while the rest refers to accessory sizes such as Necklace, Bags and Wallets etc. To prevent any confusion, our group re-coded the size from “340” to “34” for all shoes sizes.

Size.png

Materials

Materials.png

In the figure below, even though Ankle Boot and ANKLEBOOT are the same name, they are classified differently. This also applies to Ballerina and BALLERINA. Hence, our group has re-coded attribute “Material” into a separate column named “Material 2” to ensure that materials of the same name are being grouped together. For example, by grouping the name respectively, ankle boots will have a total of 30859 rows which is a combination of Ankle Boot and ANKLEBOOT. The same has been done to Subclass and Class attributes.

TransactionStoreID

TSID.jpg

To ensure that our market basket analysis to be accurate for the next phase of our practicum, our group has created a new variable name that is unique. The attribute “TransactionId” is not a unique identifier for each row of data because different Store Name of different Region could have used the same TransactionID. Hence, our group has concatenated the StoreName and TransactionId to create a unique identifier.


PHASE 2

Merging of Area Groups

MergeAreaGroups.png

During Phase 2 of Exploratory Data Analysis, a new grouping of store, Area Group, was given. To combine this with our existing dataset, we use the join function in JMP and match the store name from the area group to our existing store name in the dataset. Looking at the data below, we can observed that different stores of different city region can be considered as the same area group. Hence, giving us another dimension of analysis.

Sales Quantity vs Ordered Quantity

AYEOrderQ.png

The order quantity per article has been provided as well. We have included it with the comparison of the sales quantity to see the conversion rate.