Difference between revisions of "ANLY482 AY2016-17 T2 Group20 Findings"
Line 78: | Line 78: | ||
| msrpCount || 0 || 0.00% || 0 || 0.00% | | msrpCount || 0 || 0.00% || 0 || 0.00% | ||
|- | |- | ||
− | | | + | | minPriceOther || 78926 || 56.58% || 154 || 0.29% |
|- | |- | ||
− | | | + | | maxPriceOther || 78937 || 56.59% || 164 || 0.31% |
|- | |- | ||
− | | | + | | priceOtherCount || 0 || 0.00% || 0 || 0.00% |
|- | |- | ||
− | | | + | | sold || 0 || 0.00% || 0 || 0.00% |
|- | |- | ||
− | | | + | | used || 0 || 0.00% || 0 || 0.00% |
|- | |- | ||
− | | | + | | mileage || 49438 || 35.44% || 1924 || 3.63% |
|- | |- | ||
− | | | + | | Sq Root[mileage] || 49438 || 35.44% || 1924 || 3.63% |
|- | |- | ||
− | | | + | | Log[mileage] || 54259 || 38.90% || 4482 || 8.47% |
|- | |- | ||
− | | | + | | Log10[mileage] || 54259 || 38.90% || 4482 || 8.47% |
|- | |- | ||
− | | | + | | Cube Root[mileage] || 49438 || 35.44% || 1924 || 3.63% |
|- | |- | ||
− | | | + | | vehicleAge || 0 || 0.00% || 0 || 0.00% |
|- | |- | ||
− | | | + | | Square Root[vehicleAge] || 9 || 0.01% || 3 || 0.01% |
|- | |- | ||
− | | | + | | Log10[vehicleAge] || 4301 || 3.08% || 2705 || 5.11% |
|- | |- | ||
− | | | + | | Log[vehicleAge] || 4301 || 3.08% || 2705 || 5.11% |
|- | |- | ||
− | |||
− | |||
− | |||
− | |||
− | |||
|} | |} |
Revision as of 23:36, 19 February 2017
Working documents
- Example of analysis: File:RegressionBMW.xlsx
- Example of R code: File:RegressionBMW.txt
Contents
Data Exploration and Cleaning
Issues
Too large dataset
Due to the large dataset, it took too long to run the nodes in SAS Enterprise Miner which resulted in the inefficiency. It took approximately 6 hours to run a single node.
We reduced the dataset to BMW cars by filtering the dataset “makeName = BMW” and “used=1,2” to retrieve a dataset result of used BMW cars. We categorized the BMW cars based on the ‘modelName’, where the cleaning and categorization of ‘modelName’ will be explained below. Throughout the project, we will be using BMW cars as a model while keeping it generalised. Thereafter, we will be using the same model as a template for other car brands. This can be done by filtering the ‘makeName’ column which uses brand as the main filter.
This reduces the load on the system as we will only be using BMW cars for analysis.
Customers would already have a preferred car brand based on their buying capacity and personal preference for cars. From the general model, dealers will be able to select the car brands (“makeName”) of the customers’ preference, which will lead them to the specifications (“modelName”) of the car.
Unclean variable – modelName
When we looked at the summary statistics of modelName, we can see that there are a few categories which should belong in another. For example, in the screenshot above, you can see that there are many 3 Series trims such as ‘318’, ‘323’, ‘325i’ which should belong in the ‘3 Series’ category so should be relabeled to ‘3 Series’.
Missing Values
Column | Number Missing (before) | Percentage Missing (before) | Number Missing (after) | Percentage Missing (after) |
---|---|---|---|---|
year | 0 | 0.00% | 0 | 0.00% |
minDate | 0 | 0.00% | 0 | 0.00% |
maxDate | 0 | 0.00% | 0 | 0.00% |
daysCount | 0 | 0.00% | 0 | 0.00% |
currentMsrp | 74357 | 53.31% | 14127 | 26.69% |
currentPriceOther | 86550 | 62.05% | 0 | 0.00% |
minMsrp | 65831 | 47.20% | 9232 | 17.44% |
maxMsrp | 66243 | 47.49% | 9312 | 17.59% |
msrpCount | 0 | 0.00% | 0 | 0.00% |
minPriceOther | 78926 | 56.58% | 154 | 0.29% |
maxPriceOther | 78937 | 56.59% | 164 | 0.31% |
priceOtherCount | 0 | 0.00% | 0 | 0.00% |
sold | 0 | 0.00% | 0 | 0.00% |
used | 0 | 0.00% | 0 | 0.00% |
mileage | 49438 | 35.44% | 1924 | 3.63% |
Sq Root[mileage] | 49438 | 35.44% | 1924 | 3.63% |
Log[mileage] | 54259 | 38.90% | 4482 | 8.47% |
Log10[mileage] | 54259 | 38.90% | 4482 | 8.47% |
Cube Root[mileage] | 49438 | 35.44% | 1924 | 3.63% |
vehicleAge | 0 | 0.00% | 0 | 0.00% |
Square Root[vehicleAge] | 9 | 0.01% | 3 | 0.01% |
Log10[vehicleAge] | 4301 | 3.08% | 2705 | 5.11% |
Log[vehicleAge] | 4301 | 3.08% | 2705 | 5.11% |