Difference between revisions of "Team Accuro Project Overview"
Line 198: | Line 198: | ||
The following elbow plot was generated using R.<br> | The following elbow plot was generated using R.<br> | ||
<br> | <br> | ||
− | [[Image: | + | [[Image:ElbowPlot_for_clustering.png|center|Elbow Plot for clustering]]<br> |
As there is a clear break at 4 number of clusters, we proceeded to carry out clustering with 4 clusters.<br> | As there is a clear break at 4 number of clusters, we proceeded to carry out clustering with 4 clusters.<br> | ||
<br> | <br> | ||
Line 209: | Line 209: | ||
<div align="left"> | <div align="left"> | ||
+ | |||
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Feature Extraction and Regression Analysis</font></div>== | ==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Feature Extraction and Regression Analysis</font></div>== | ||
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; "> | <div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; "> |
Revision as of 10:48, 28 September 2015
Contents
Introduction and Background
The Yelp Dataset Challenge provides data on ratings for several businesses across 4 countries and 10 cities to give students an opportunity to explore and apply analytics techniques to design a model that improves the pace and efficiency of Yelp’s recommendation systems. Using the dataset provided for existing businesses, we aim to identify the main attributes of a business that make it a high performer (highly rated) on Yelp. Since restaurants form a large chunk of the businesses reviewed on Yelp, we decided to build a model specifically to advice new restaurateurs on how to become their customers’ favourite food destination.
With Yelp’s increasing popularity in the United States, businesses are starting to care more and more about their ratings as “an extra half star rating causes restaurants to sell out 19 percentage points more frequently”. This profound effect of Yelp ratings on the success of a business makes our analysis even more crucial and relevant for new restaurant owners. Why do some businesses rank higher than others? Do customers give ratings purely based on food quality, does ambience triumph over service or do geographic locations of businesses affect the rating pattern of customers? Through our project we hope to analyse such questions and thereby be able to advice restaurant owners on what factors to look out for.
Review of Similar Work
The aim of the study is to aid businesses to compare performances (Yelp ratings) with other similar businesses based on location, category, and other relevant attributes.
The visualization focuses on three main parts:
a) Distribution of ratings: A bar chart showing the frequency of each star rating (1 through 5) for a single business.
b) Number of useful votes vs. star rating A scatter plot showing every review for a given business, with the x-position representing the “useful” votes received and y-position representing the for the business.
c) Ratings over time: This chart was the same as Chart 2, but with the date of the review on the x-axis
The final product is designed as an interactive display, allowing users to select a business of interest and indicate the radius in miles to filter the businesses for comparison. We will use this as a base and help expand on some of its shortcomings in terms of usability and UI. We will further supplement this with analysis of our own using other statistical methods to help derive meaning from the dataset.
2) Your Neighbors Affect Your Ratings: On Geographical Neighborhood Influence to Rating Prediction
This study focuses on the influence of geographical location on user ratings of a business assuming that a user’s rating is determined by both the intrinsic characteristics of the business as well as the extrinsic characteristics of its geographical neighbors.
The authors use two kinds of latent factors to model a business: one for its intrinsic characteristics and the other for its extrinsic characteristics (which encodes the neighborhood influence of this business to its geographical neighbors).
The study shows that by incorporating geographical neighborhood influences, much lower prediction error is achieved than the state-of-the-art models including Biased MF, SVD++, and Social MF. The prediction error is further reduced by incorporating influences from business category and review content.
We can look to extend our analysis by looking at geographical neighbourhood as an additional factor (that is not mentioned in the dataset) to reduce the variance observed in the data and improve the predictive power of the model.
3) Spatial and Social Frictions in the City: Evidence from Yelp
This paper highlights the effect of spatial and social frictions on consumer choices within New York City. Evidence from the paper suggests that factors such as travel time, difference in demographic features etc. tend to influence consumer choice when deciding what restaurant to go to.
Motivation
We believe that our topic of analysis is crucial for the following reasons:
1) It will make the redirection of customers to high quality restaurants much easier and more efficient.
2) It can encourage low quality restaurants to improve in response to insights about customer demand.
3) The rapid proliferation of users trusting online review sites and and incorporating them in their everyday lives makes this an important avenue for future research.
4) Prospective restaurant openers (or restaurant chain extenders) can intelligently decide the location based on the proximity factor to other restaurants around them.
Key guiding Questions
1) What constitutes the restaurant industry on Yelp?
2) What are the salient features of these inherent groupings?
3) How important is location within all of this?
4) What are some of the trends that have emerged recently?
5) Can we predict the ratings of new restaurants?
Project Scope and Methodology
Primary requirements
Step 1: Descriptive Analysis - Analysing Restaurants specifically for what differentiates High performers, low performers and Hit or Miss restaurants. For each of the 3 segments mentioned, the following analysis will be done:
- Clustering to analyse business profiles that characterize the market. Explore various algorithms and evaluate each of the algorithms to decide which works best for the dataset.
Step 2: Key factors identification for prescriptive analysis (feature extraction) for new restaurants by region, in order to succeed. Regression will be used to identify the most important factors and the model will be validated so that we can analyse how good the model is. This will constitute the explanatory regression exercise.
Step 3: Spatial Lag regression model. This section will focus on Geospatial Analysis to examine the effect of location of a business on its rating. The goal of this will be to modify the regression model in Step 2 by adding the geospatial components as additional variables to the model. This section will explore the three spatial regression models and use the model that best fits the dataset:
- Checking for Spatial Autocorrelation: Spatial dependencies existence will be checked using Moran’s I (or any other spatial autocorrelation index) to see if they are significant.
- Weight Matrix Calibration: Developing the model will involve choosing the Neighbourhood Criteria and consequently developing an appropriate weight matrix to highlight the effect of the lag term in the equation.
- Appropriate model for Spatial dependencies: The Spatial Lag Regression Model and the Spatial Error Regression Models can both be used to understand the effect of location and whether the Dependent variable has dependence, or whether the Error Term does.
Step 4: Build a visualization tool for client for continual updates on business strategy. Focus will be to build a robust tool that helps the client recreate the same analysis on tableau.
Secondary requirements
A. Time series analysis of whether any major trends have emerged in restaurants by region – further decipher the does and don’ts for success
B. As an extension, we will also attempt to predict the rating for new restaurants, thereby informing existing restaurants of potential competition from new openings.
Future research
Evaluating the importance of review ratings for restaurants – Are they effective to improve ratings? Do restaurants that utilize recommended changes succeed?
Descriptive Analysis
Data Cleaning and Manipulation
- Sampling subset (used Arizona for current analysis)
- Filtering out unnecessary fields, rows and columns:
- Changing data types and imputation of missing data
- Excluding Restaurants with small review count
- Log-transformation for review count
- Included category variables
- Mean rating for categories
- Divided into low/high performer
Exploratory Data Analysis
Clustering
a) For K-means and K-Medoids Clustering, all variables must be in numeric form. Therefore, the following changes were made to the different variable types to convert them to numeric form.
b) For Mixed Clustering, no data conversions were required as the algorithm recognises all types of data. Missing values are also acceptable.
However, due to lack of meaningfulness of some variables in the clustering process, such as name, business id, the variables were assigned a weight of 0 to exclude them from analysis.
K-Means Clustering
After converting all variables into numeric form and imputing the missing values with average value, k-means clustering technique was used to cluster the businesses.
However, due to the nature of the data, k-means clustering is not be the most ideal clustering algorithm. The issues with the technique are as follows:
a) As binary variables were converted into numeric, the resulting clustering means may not be as representative.
b) Due to presence of outliers in the data, the clustering will be skewed.
K-Medoids Clustering
After converting all variables into numeric form and imputing the missing values with average value, k-medoids clustering technique was used to cluster the businesses.
K-Medoids clustering is a variation of k-means clustering. In K-Medoids clustering, the cluster centres (or “medoids”) are actual points in the dataset. The algorithm begins in a similar way ask-means by assigning random cluster centres. But, in k-medoids the cluster centres are actual data points. A total cost is calculated by using the summing up the following function for all non_medoid-medoid pairs:
cost(x,c)=∑_(i=1)^d(|xi-ci|)
, where x is any non-medoid data point and c is a medoid data point.
In each iteration, medoids within each cluster are swapped with a non-medoid data point in the same cluster. If the overall cost is less (usually defined by Manhattan distance), the swapped non-medoid is declared as new medoid of the cluster.
Although, k-medoids does protect the clustering process from skewing caused by outliers, it still has other disadvantages. The issues with the K-Medoid technique are:
a) As binary variables were converted into numeric, the resulting clustering means may not be as representative.
b) The computational complexity is large.
Mixed Clustering
Partitioning around medoids (PAM) with Gower Dissimilarity Matrix
As our dataset is a combination of different types of variables. Therefore, a more robust clustering process is needed which does not require the variables to be converted to numeric form.
Gower dissimilarity technique is able to handle mixed data types within a cluster. It identifies different variable types and uses different algorithms to define dissimilarities between data points for each variable type.
For dichotomous and categorical variables, if the values for two data points are same, dissimilarity is 0 and vice versa.
For numerical variables, distance is calculated using the following formula:
1- sijk = |xi – xj|/Rk
where sijk is the similarity between data points xi and xj in the context of variable k, and Rk is the range of values in variable k
The daisy() function in the cluster library in R is used for the above steps.
The dissimilarity matrix generated is used to cluster with k-medoids (or PAM) as described earlier. The dissimilarity matrix obtained serves as the new cost function for k-medoids clustering.
We call this two-step process “Mixed Clustering”. This method has a number of datasets:
a) As k-medoids method is used, the clustering is not affected by outliers.
b) Clustering can be done without changing the data types.
c) Missing data can also be handled by the Gower dissimilarity algorithm.
Elbow Plots:
The following elbow plot was generated using R.
As there is a clear break at 4 number of clusters, we proceeded to carry out clustering with 4 clusters.
Mixed Clustering
Feature Extraction and Regression Analysis
Approach
Findings
Assumptions
Spatial Lag Analysis
Approach
Spatial Autocorrelation
Limitations and Assumptions
Limitations | Assumptions |
Limited data points on businesses and cities | Project methodology will be scalable for looking at regional trends |
Limited action-ability of insights since companies may not care about Yelp ratings | Project findings will help set priorities for improvement for business owners |
Businesses attribute may not be completely accurate | Assuming that data has been updated as accurately as possible |
Defining business categories | Assuming business tags under categories are comprehensive for the competitive set |
Deliverables
- Project Proposal
- Mid-term presentation
- Mid-term report
- Final presentation
- Final report
- Project poster
- Project Wiki
- Visualization tool on Tableau
Work Scope
Through this project we are hoping to build to an interactive dashboard as a solution to the ratings and recommendations system Dataset Challenge by Yelp. This will be in addition to the insights developed from statistical and machine learning techniques that can support decision making for businesses. Some areas of research we are looking into are:
- Cultural Trends
- Seasonal Trends
- Spatial Lag Regression Analysis
- Time Series Analysis
- K-Means, K-Medoids and Gower's Method for Clustering
- Explanatory Regression analysis
- Predictive Regression analysis