Automotive Angels - Discussion
Contents
Discussion
Technical Discussion Analysis
In general, we found that creating a generalized model for all makes or models does not serve as a good predictive model due to the large RMSE. Hence, modelling should be done on a specific vehicle model. Also, within each model there may be iterations such as X5 and X5M which have significant differences and hence should not be rolled up to be the same model. This poses a difficulty as it will be costly to model different car makes as there are many and increasing number of car models released each year.
Ford
From the above results, it seems that we have a workable model. Our errors are relatively low at about R2 of 0.85-0.9 and RMSE of between 2000-3000. Our residual plots also show a good spread around 0. However, there is room for improvement for Ford Edge and Ford Explorer as there are some extreme outliers. There is also a slight heteroscedasticity which is more obvious in Ford Focus. Most of the residue also seems to be negative which means that the predicted value is more often higher than lower compared to the actual value.
Chevrolet
From the above results, the model shows an adjusted R2 ranging from 0.79 to 0.88. The model seems to be a good fit for each specific model. Also, the RMSE shows that error ranges from 2000 to 3000, and given the median price is within the band of 12000-16000USD, such variance is considered acceptable.
BMW
For BMW, we see an RMSE ranging from 2000-3000. Considering that the prices are ranged from 25000-43000USD, such price margin for a predictive model is small. In addition, the adjusted R2 ranges from 0.85-0.89 where it shows goodness of fit. However, looking at the residual plots, it seems to indicate patterns that show biased results, especially for BMW X3 as shown in Figure 55. A very clear horizontal line maybe since the current selling price crawled from the websites could be a specific selling price approximation by dealers.
Managerial communication on Analysis
The comments given by Steve Greenfield, CEO of Automotive Ventures, was that “This looks very impressive! Nice work, team!” In communicating with David Chisnell, the resident Data Scientist for Automotive Ventures, he mentioned that “On nearby and far dealers, especially the dealers beyond 50 miles and less than 100 miles, we had been assuming they were too far away to matter but may revise that based on your work and change the production algorithm that handles all makes and models. It's one of the things I’d like to hear more about from the team. You've also started a conversation the business implications of using days to sell (meaning prices decline over time), and using number of price changes, (we read that as a suggestion they over-corrected, although it's also possible it's acting as a proxy for hidden information like cars with a bad car-fax or that smell like smoke take longer to sell and are more likely to be dropped in price). On Friday, I'd be interested if your team has thoughts on which factors are direct and which might be acting as a proxy. I also noticed there was a negative impact attached to historical suggested retail price and wanted to discuss how that was calculated. It's good work and helps make our business smarter!” Specific comments on analysis are not included in this paper due to time conflicts and organization changes in the company.
Recommendations for client
We recommend further exploration and validation before implementing the current workflow process in Automotive Venture’s business model. With the current work, the process steps and regression calculations will be exported out of SAS Enterprise Miner into a format where the programmers can implement in Python or Ruby. Existing data filters will reduce the computational power required to compute the optimal regression model for each vehicle model and predict a price range. However extensive data cleaning is required for other vehicle makes and pose an issue as data crawled daily may not be effectively clean and usable for analysis immediately. This will affect the selling proposition of the company whereby data used is not real time and requires going through time for processing.
Conclusion
Lessons learnt
Through the course of data preparation of the dataset, one of our key takeaways would be that while parametric tests usually follow the assumptions of normality, homogeneity of variances, linearity and independence, not all variables require transformation. Hence, it does not mean that the data points must be normal or require transformation to normality. One of the examples would be the log10 transformations that we have made, which is not required.
Another lesson that we have learnt is that business objectives have high implications on the predictors that we use. For instance, from our primary data, we discovered that customers do not travel more than 100 miles to search for better prices and we can incorporate this information in our analysis.
Limitations
Actual transaction data
Having actual transaction data will improve our data accuracy. However, there is limited actual transaction data in the used car market. Hence the current selling price obtained from dealership websites, which are updated once each day, is used as a proxy for the actual selling price. Also, there are other factors such as insurance, and personal credit rating that might affect the actual selling price of used car. Yet, there is no consolidated data market for such transactions.
Observing effects of seasonality
In the primary data collection stage, the dealers expressed that seasonality is an important decision factor when deciding to set price for used cars. However, we were unable to observe effects of seasonality as there are no yearly dataset. The dataset stored in AWS S3 is not complete. Our client mentioned that as the dataset kept changing, the dataset that is stable and reliable only starts from September. Since we do not have a full year cycle of data, we are unable to analyze effects of seasonality on pricing.
Sample size of each vehicle model
There are some vehicle models which are quite rare and hence the sample size being analyzed is small. Hence for these individual vehicle models, the model can be highly inaccurate.
Dataset crawled seem to be bias
From our initial data exploration stage, we see that there is a higher proportion of data crawled from the east of USA as compared to the west of USA. Used car volume in states with high population such as California were unexpectedly sparse.
Future Explorations
We recommend the company to explore creation of other predictors that could further improve prediction of price range. The Huff model offers a perspective to calculate how different dealerships can attract consumers at each location and the sales potential of each dealership can be further explored using the proxy of each individual dealership’s total listings in the data to investigate the pull of individual dealership. Also, we can also investigate if customers purchasing a make specific vehicle will go to the franchise dealerships to look for the specific make. We can investigate if dealers with inventory that has short days on sale price higher or lower and investigate the relationship between days on sale and its pricing. These investigations are not commonly done in the used car market and can bring about valuable insights.
Acknowledgements
Special thanks to Professor Kam Tin Siong, Prakash Chandra Sukhwal from Singapore Management University and Steve Greenfield, David Chisnell from Automotive Ventures.