Difference between revisions of "ANLY482 AY2017-18T2 Group32: Project Overview/Methodology"
Jump to navigation
Jump to search
Zycheng.2014 (talk | contribs) |
Zycheng.2014 (talk | contribs) |
||
Line 54: | Line 54: | ||
HCP | HCP | ||
# Removed dummy variables with suspicious looking values in the “postal code” column such as (000001,000002, etc) | # Removed dummy variables with suspicious looking values in the “postal code” column such as (000001,000002, etc) | ||
+ | [[File:Methedology 1.png | 400px | center]] | ||
+ | |||
HCO | HCO | ||
Line 70: | Line 72: | ||
* Click on “Replace All” to apply the changes | * Click on “Replace All” to apply the changes | ||
− | + | # For 5-digits values in “postal code” column, a new column is created to store the converted ‘postal code’ in 6-digits format, with data type changed from numerical to categorical and formula function used to add the missing ‘0’. | |
Revision as of 22:29, 25 February 2018
Return to ANLY482 AY2017-18 Home Page
Description | Data | Methodology |
Methodology
Before performing any data analysis, it is critical to perform exploratory data analysis to understand the data better. Errors such as incomplete and invalid values could lead to inaccurate results. Hence, the data must be cleaned so that it is suitable for any further analysis. The following summarizes the data cleaning process we have taken for each dataset:
HCP
- Removed dummy variables with suspicious looking values in the “postal code” column such as (000001,000002, etc)
HCO
- Removed dummy variables with suspicious looking values in the “postal code” column such as (000001,000002, etc) or Test Accounts.
Invoice Detail
- Filter transactions to obtain TCE products using TCE_Brands sheet
- For negative values in the “sales qty” and “amount$” columns, we kept the values as it is. This is because after checking with our sponsor, these records are products that have been returned back to them, so they served as sales that have been void. Upon aggregation by quarters, no negative values will be present.
- For the missing values in “price$” columns, we imputed their values to “0” as these rows contain records where free samples are given out to the customers.
- Using Ctrl-F function to show “search data tables” interface, then enter the following fields:
- Find what: “.”
- Replace with: “0”
- Tick “Match entire cell value”
- Tick “Restrict to selected column” of “Price$”
- Tick “Search data”
- Click on “Replace All” to apply the changes
- For 5-digits values in “postal code” column, a new column is created to store the converted ‘postal code’ in 6-digits format, with data type changed from numerical to categorical and formula function used to add the missing ‘0’.