Difference between revisions of "ANLY482 AY2016-17 T2 Group20 Findings"

Revision as of 23:36, 19 February 2017

Working documents

Example of analysis: File:RegressionBMW.xlsx

Example of R code: File:RegressionBMW.txt

Data Exploration and Cleaning

Issues

Too large dataset

Due to the large dataset, it took too long to run the nodes in SAS Enterprise Miner which resulted in the inefficiency. It took approximately 6 hours to run a single node.

We reduced the dataset to BMW cars by filtering the dataset “makeName = BMW” and “used=1,2” to retrieve a dataset result of used BMW cars. We categorized the BMW cars based on the ‘modelName’, where the cleaning and categorization of ‘modelName’ will be explained below. Throughout the project, we will be using BMW cars as a model while keeping it generalised. Thereafter, we will be using the same model as a template for other car brands. This can be done by filtering the ‘makeName’ column which uses brand as the main filter.

This reduces the load on the system as we will only be using BMW cars for analysis.

Customers would already have a preferred car brand based on their buying capacity and personal preference for cars. From the general model, dealers will be able to select the car brands (“makeName”) of the customers’ preference, which will lead them to the specifications (“modelName”) of the car.

Unclean variable – modelName

When we looked at the summary statistics of modelName, we can see that there are a few categories which should belong in another. For example, in the screenshot above, you can see that there are many 3 Series trims such as ‘318’, ‘323’, ‘325i’ which should belong in the ‘3 Series’ category so should be relabeled to ‘3 Series’.

Missing Values

Column	Number Missing (before)	Percentage Missing (before)	Number Missing (after)	Percentage Missing (after)
year	0	0.00%	0	0.00%
minDate	0	0.00%	0	0.00%
maxDate	0	0.00%	0	0.00%
daysCount	0	0.00%	0	0.00%
currentMsrp	74357	53.31%	14127	26.69%
currentPriceOther	86550	62.05%	0	0.00%
minMsrp	65831	47.20%	9232	17.44%
maxMsrp	66243	47.49%	9312	17.59%
msrpCount	0	0.00%	0	0.00%
minPriceOther	78926	56.58%	154	0.29%
maxPriceOther	78937	56.59%	164	0.31%
priceOtherCount	0	0.00%	0	0.00%
sold	0	0.00%	0	0.00%
used	0	0.00%	0	0.00%
mileage	49438	35.44%	1924	3.63%
Sq Root[mileage]	49438	35.44%	1924	3.63%
Log[mileage]	54259	38.90%	4482	8.47%
Log10[mileage]	54259	38.90%	4482	8.47%
Cube Root[mileage]	49438	35.44%	1924	3.63%
vehicleAge	0	0.00%	0	0.00%
Square Root[vehicleAge]	9	0.01%	3	0.01%
Log10[vehicleAge]	4301	3.08%	2705	5.11%
Log[vehicleAge]	4301	3.08%	2705	5.11%

@@ Line 78: / Line 78: @@
 | msrpCount ||  0 || 0.00% || 0 || 0.00%
 |-
-| Example || Example || Example || Example || Example
+| minPriceOther || 78926 || 56.58% || 154 || 0.29%
 |-
-| Example || Example || Example || Example || Example
+| maxPriceOther || 78937 || 56.59% || 164 || 0.31%
 |-
-| Example || Example || Example || Example || Example
+| priceOtherCount ||  0 || 0.00% || 0 || 0.00%
 |-
-| Example || Example || Example || Example || Example
+| sold ||  0 || 0.00% || 0 || 0.00%
 |-
-| Example || Example || Example || Example || Example
+| used ||  0 || 0.00% || 0 || 0.00%
 |-
-| Example || Example || Example || Example || Example
+| mileage || 49438 || 35.44% || 1924 || 3.63%
 |-
-| Example || Example || Example || Example || Example
+| Sq Root[mileage] || 49438 || 35.44% || 1924 || 3.63%
 |-
-| Example || Example || Example || Example || Example
+| Log[mileage] || 54259 || 38.90% || 4482 || 8.47%
 |-
-| Example || Example || Example || Example || Example
+| Log10[mileage] ||  54259 || 38.90% || 4482 || 8.47%
 |-
-| Example || Example || Example || Example || Example
+| Cube Root[mileage] || 49438 || 35.44% || 1924 || 3.63%
 |-
-| Example || Example || Example || Example || Example
+| vehicleAge ||  0 || 0.00% || 0 || 0.00%
 |-
-| Example || Example || Example || Example || Example
+| Square Root[vehicleAge] || 9 || 0.01% || 3 || 0.01%
 |-
-| Example || Example || Example || Example || Example
+| Log10[vehicleAge] || 4301 || 3.08% || 2705 || 5.11%
 |-
-| Example || Example || Example || Example || Example
+| Log[vehicleAge] || 4301 || 3.08% || 2705 || 5.11%
 |-
-| Example || Example || Example || Example || Example
-|-
-| Example || Example || Example || Example || Example
-|-
-| Example || Example || Example || Example || Example
 |}

Difference between revisions of "ANLY482 AY2016-17 T2 Group20 Findings"

Revision as of 23:36, 19 February 2017

Contents

Data Exploration and Cleaning

Issues

Too large dataset

Unclean variable – modelName

Missing Values

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools