Automotive Angels - Analysis

From Analytics Practicum
Jump to navigation Jump to search

HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

Analysis

Exploratory Data Analysis

For data exploration, we will be conducting exploratory data analysis. Exploratory data analysis includes the following steps:

  1. Analyzing the distribution of variables
  2. Treating outliers that were identified
  3. Compare models’ “Current Selling Price”
  4. Examine relationships between variables
  5. Develop hypotheses based on literature and hypothesis testing

Relationship between Current Selling Price and Current Suggested Retail Price

Before plotting a scatterplot, we removed the extreme outliers from “Suggested Retail Price” to get more detailed scatterplot. Extreme outliers are more than 3 times of the interquartile range past the tail quantile and 1- tail quantile where the tail quantile is defined as 0.1. From the Figures below, “Current Selling Price” is strongly correlated to “Current Suggested Retail Price” with a R2 of 0.888 (Figure 9) for BMW. This shows that “Current Suggested Retail Price” is a good predictor of “Current Selling Price” of Used Cars.

Figure 9: Scatterplot of Current Selling Price and Current Suggested Retail Price for BMW

Figure 9: Scatterplot of Current Selling Price and Current Suggested Retail Price for BMW

Relationship between Current Selling Price and other variables

Figure 10: Parameter estimates based on fit of least squares for BMW

Figure 10: Parameter estimates based on fit of least squares for BMW

From the results of the fit of least squares in Figure 10, we know that the “Maximum Historical Suggested Retail Price” have a high VIF value of close to 84 and “Maximum of Historical Selling Price” have a high VIF value of 82. As such, there is multicollinearity and we will take it out to reduce noise and redundant predictor variables.

Number of listings by dealer

We can see that most of the used cars listed from BMW (Figure 11) listings are mostly from the East of North America other than some areas of concentration in cities in the East. One reason might be that more data is collected from dealers situated in the East of North America.

Figure 11: Map of BMW Dealers’ locations by Number of Listings for BMW

Figure 11: Map of BMW Dealers’ locations by Number of Listings for BMW

Relationship between Current Selling Price and Corrected Vehicle Models

Figure 12: Box Plot of Current Selling Price by Corrected Vehicle Model for BMW

Figure 12: Box Plot of Current Selling Price by Corrected Vehicle Model for BMW

From Figure 12, we observed that there are models which have high median price as compared to other models such as i3 and i8. Also, we observed that models such as X5 and X5 M have significantly different prices as well and hence although both models are similar, it should not be merged as the same model.

Vehicle Models by Frequency and Current Selling Price

Figure 13: Table of Corrected Vehicle Models by Number of Rows and Mean of Current Selling Price for BMW

Figure 13: Table of Corrected Vehicle Models by Number of Rows and Mean of Current Selling Price for BMW

We decided to choose the top 4 models to focus on in terms of frequency. They are: 3 Series, 5 Series, X3 and X5. From Figure 11, these 3 car models have similar price range of between 26000-36000USD.

Application of Methodology

According to Figure 3: Process Flow Overview, below the detailed steps taken at each stage.

  1. Filter out “Car Make” = BMW from the full dataset.
  2. Filter out “Used Status” = 0 from the remaining dataset since we are only predicting at used cars.
  3. Filter out “Vehicle Category” = ‘Not Car’ from the remaining dataset since we are only predicting cars.
  4. Filter out extreme outliers (Interquartile Range Multiplier = 3, Tail Quantile = 0.1) for “Current Selling Price” and “Vehicle Mileage”.
  5. Filter out 99995 < “Current Selling Price” < 99999 since those are likely to be wrongly recorded, as seen in the exploratory data analysis.

After the data is prepared, we can pass it through the above process flow shown in Figure 49 in the appendix:

  1. Data Import – we imported the prepared data into the process flow.
  2. Data Partition – we would be partitioning it randomly at 60% training data and 40% testing data to help prevent overfitting compared to 70% training data and 30% testing data, as said previously.
  3. Filter out each “Vehicle Model” – We will be creating a regression equation customized to each vehicle model.
  4. Filter out extreme outliers of below the 10th percentile and above the 90th percentiles. This also filters out “Current Selling Price” that are close to 0 which is unlikely because used cars would not be listed for no value.
  5. Linear regressions using 3 methods – Stepwise, Backward and Forward. We also include an option to minimize the difference in errors between the prediction result from train and prediction result validation using cross validation.
  6. Model selection helps us to choose the best model for linear regression based on the least mean squared errors.


Results

Using the same methodology, we redid the same steps for other vehicle makes, Ford and Chevrolet, and the results are as follows.

Results Summary

Vehicle Model Median Min Max RMSE Train RMSE Test Adjusted R2 Chosen Regression Model
Ford Focus 11625 6600 17999 1248.25 1672.34 0.7824 Forward
Ford Explorer 25971 7995 35000 2179.76 3070.73 0.9431 Forward
Ford Fusion 14871 8750 20437 1204.1 1594.91 0.8595 Stepwise
Ford Edge 20996 11200 32880 2108.81 2500.24 0.8987 Stepwise
Chevrolet Equinox 16998 10996 23898 1199 2647.01 0.8335 Stepwise
Chevrolet Malibu 13998 8000 18878 892.84 1942.97 0.8771 Forward
Chevrolet Cruze 12599 9000 15998 844 2362.063 0.7876 Forward
BMW 3 Series 25993 18888 36840 2859.33 3292.121 0.8608 Forward
BMW 5 Series 34994 23981 47600 2941.032 3352.48 0.8733 Stepwise
BMW X3 35811 25982 45645 2612.88 2874.16 0.8856 Backward
BMW X5 42991 23995 55981 2108.81 2500.24 0.8556 Stepwise

Figure 14: Table of Summary of Results from Regressions

Vehicle Model Effects of Chosen Regression Model
Ford Focus Intercept, Current_Suggested_Retail_Price, Vehicle_Model_Age
Ford Explorer Intercept, Current_Suggested_Retail_Price, Number_of_Changes_in_Selling_Price, Number_of_Changes_in_Suggested_Retail_Price, Vehicle_Model_Age
Ford Fusion Intercept, Current_Suggested_Retail_Price
Ford Edge Intercept, Current_Suggested_Retail_Price, Dealer_State, Vehicle_Model_Age
Chevrolet Equinox Intercept, Current_Suggested_Retail_Price, Vehicle_Mileage, Vehicle_Model_Age
Chevrolet Malibu Intercept, Current_Suggested_Retail_Price, Number_of_Changes_in_Selling_Price, Number_of_Changes_in_Suggested_Retail_Price, Vehicle_Mileage, Vehicle_Model_Age
Chevrolet Cruze Intercept, Current_Suggested_Retail_Price, Vehicle_Model_Age
BMW 3 Series Intercept, Current_Suggested_Retail_Price, Vehicle_Mileage
BMW 5 Series Intercept, Current_Suggested_Retail_Price, Days_On_Sale, Dealer_City, Vehicle_Model_Age
BMW X3 Intercept, Current_Suggested_Retail_Price, Days_On_Sale, Dealer_City, Number_of_Changes_in_Selling_Price, Number_of_dealers_in_50_miles, Number_of_similar_cars_in_50_miles, Size_of_Dealership, Vehicle_Mileage, Vehicle_Model_Age
BMW X5 Intercept, Current_Suggested_Retail_Price, Dealer_City Dealer_State, Number_of_Changes_in_Selling_Price, Number_of_Changes_in_Suggested_Retail_Price, Vehicle_Model_Age

Figure 15: Table of Summary of Effects from Regressions

From our results, we observed the best performing model to be Ford Fusion with a RMSE of 1248 for Train and 1672 for Test. The difference of approximately 400 between train and test set shows good predictability of the model. The results also show that for some models, there are signs of overfitting, such as Ford Explorer and BMW X3 where we observe high adjusted R2 of 0.88 and above. Also, the variable selection for various models vary individually which suggested that a generalized model with common pool of predictors was not suitable. The results also show that the Current Suggested Retail price offers a good base for price prediction, and common predictors are Days on Sale, Vehicle Model Age and Vehicle Mileage. Effects of depreciation are observed where models incorporated Vehicle Model Age and Vehicle Mileage as predictors. Further explorations can be done on Days on Sale with Current Selling Price. Specifically, for BMW, we observe that the created variables such as Number of dealers in 50 miles and Number of Similar Cars in 50 miles are selected as predictors in BMW X3, which propose that effects of competition and geographical location affects pricing in certain models.