Difference between revisions of "ANLY482 AY2016-17 T2 Group20 Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
Line 59: Line 59:
 
# The main application uses EC2 instances to access data from the AWS RDS MySQL database.
 
# The main application uses EC2 instances to access data from the AWS RDS MySQL database.
  
* Putty and PowerBI is used to ensure data consistency and raise any failures of the application.  
+
Putty and PowerBI is used to ensure data consistency and raise any failures of the application. AWS S3 is used as a storage backup for all data in the servers and databases.
* AWS S3 is used as a backup for all data in the servers and databases.
 
 
 
  
 
== Data Provided ==
 
== Data Provided ==

Revision as of 22:36, 8 January 2017

HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

Motivation and Business Problem

Background Information

On average, car dealers spend between $4000 and $20,000 on SaaS products, and spend thousands to find out more about their competitors. The industry itself is highly competitive with franchise dealers (Legacy, Jim Ellis) dominating the market followed by smaller individual dealers.

Business Problem

The resale market makes up about 80% of the total automobile market1. The main problem we are addressing is that resale dealers do not know the best way to price their used cars. Even though other dealers’ pricing information can be found online, it is difficult for retailers to compare prices, mileage, make, customizations along with other factors to come up with the most competitive price. There is also a highly competitive market of resale dealers which shows the seriousness of this problem. We think that by using existing data, modelling and visualizing it can help these resale dealers make better pricing decisions.

Project Objectives

Deliverable
Easy to use web interface that automatically links the daily scraping of data, analyzing the data and coming up with conclusions and result sets that are useful to our sponsors.

Functionality
We model the best suggested price range for the dealer to resell the used cars. These factors include the car model, mileage, make, year, state, city and ZIP input.

Output / Results

  • Suggested optimal price range for car resale value
    • Improved accuracy of prediction of price based on mileage, depreciation of cars in different states, colour, car model etc.
    • Pricing tool with geolocation of competitors
  • Visualization of the data in an understandable manner for dealerships to make decisions regarding pricing

Project Details

CompetitorPRO's Architecture

System architecture.png

  1. Data is collected using Ruby workers by crawling and scraping dealers’ websites each day.
  2. All the collected raw data is stored in AWS RDS in a MySQL database.
  3. After collection of data, it is cleaned and the output is stored in another table in AWS RDS.
  4. The main application uses EC2 instances to access data from the AWS RDS MySQL database.

Putty and PowerBI is used to ensure data consistency and raise any failures of the application. AWS S3 is used as a storage backup for all data in the servers and databases.

Data Provided

Data is obtained through crawling of individual dealership websites and other service providers such as autotrader.com and edmunds.com. About 30,000,000 rows, approximately 6GB of data is obtained each day.

The data is saved on MySQL Database and stored on S3 on Amazon Web Services as a .csv file. Below are the description of the 24 factors in the data provided.

Variables.png

Methodology

Data Exploration and Cleaning

Exploratory Data Analysis techniques were used to do a preliminary data sanity check. Using R, statistical description function were utilised to find number of missing values; min; max; range; sum of all non-missing values; median; average; standard error; variance and variance coefficient.

There were quite a few anomalies that need to be further explored.

1. Vin

Vin are unique identifiers of each car in the United States. There were at least 20,000 of vin duplicates found. These vin duplicates came from various websites where the same car can be listed on the individual dealer websites and other trading websites such as autotrader.com.

Each of these vin duplicate may have a different price listed with the same Year, Make, Model, Trim listed. However a defining difference is where the timestamp where the data is crawled. From the maximum date, the latest updated data on each vin can be obtained.

2. Year

The min and max of year values found were 1981-2019. The variable “year” which identifies the year that the car was released was beyond 2016 and 2017. The client clarified and ensured that the reason for the data was that some car dealers may choose to not include the year the car was made, or to update the website with future releases of car models.

3. trimName

Current trimNames are not clearly bucketed and there are too many levels in a factor. This poses a difficulty in terms of analysing trimNames as a factor for price.

Histograms were also used to check the frequency of prices of cars and mileage, and backed upon experts who have deep knowledge of the automotive industry.

Histograms were found to be skewed with 80% of values of a certain number for variables of price and mileage.

Histogram of data mileage.png
Histogram of data price other.png

Others

Upon investigation with the software developers, we found that the delimiter that they have used is in the form of a “,” which may have confused the crawlers as numbers can also contain a comma in the figure. And we suggested to change the “,” to a pipe or tab delimiter.

Lastly, and the most important step, is to ensure that the crawlers are grabbing an accurate dataset, checks on 300 individual vins on each website where the data was grabbed was done manually, ensure that the data we had was accurate.

Through this exercise, we found that there were links which are broken and websites where the data was not grabbed accurately. We informed our client regarding these websites and data was fixed.

We repeated the above steps once again with the same dataset obtained at different dates to do a data sanity check.

Through the exercise, we realised that the programmers do not put in enough interim checks to ensure that daily data is good. And hence we recommended them to put in certain checks such as visualisations. We helped to link PowerBI with their mySQL data using Putty (http://www.putty.org/) where it generates simple illustrations in the forms of graphs. For example, we used unique dealerCount to ensure that they have grabbed all data from all dealers in the United States.

These visualisations are automatically generated each day and early detection of system failure or errors is made possible.

Data Modelling

Application Visualization

Technology

Risks & Limitations

Risks & Limitations Mitigation Strategy
Data received unusable for analysis

Clear and effective communication with client to manage expectations and ensure that data required is given

Lack of experience with analytical modelling (eg. randomForest)

Explore and familiarise with the analytical modelling prior to using them to perform actual analyses. Read previous projects done by community and consult advisors.

Changes in work schedule due to unexpected events, delays and time zone difference

Communicate with client and set regular meeting timings. Team members to stay up to date on deliverables and project timeline.


References

1http://www.niada.com/uploads/dynamic_areas/tRRlH6fX2WoqiCcaonlq/33/2015ManheimUsedCarMarketReport.pdf


Unique identifiers of car, VIN :


Research on randomForest modeling: