Difference between revisions of "Team Accuro Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
 
(83 intermediate revisions by 2 users not shown)
Line 28: Line 28:
 
<br>
 
<br>
 
<div align="left">
 
<div align="left">
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Introduction and Project Background</font></div>==
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Introduction and Background</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">In the past decade, we have witnessed the rapid proliferation of social media worldwide. Since Twitter launched in 2006, the social networking microblogging service has grown rapidly to become the second largest social network after Facebook. Twitter now boasts 284 million monthly active users and they send out 500 million tweets per day as of December 2014.<ref>About Twitter. (2014, December). Retrieved from https://about.twitter.com/company</ref> It has become a real-time information network generated by people around the world that let users share their thoughts about various topics in short updates or tweets in 140 characters of text or less. According to a report by We Are Social<ref>Kemp, S. (2015, January 21). Digital, Social & Mobile in 2015. Retrieved from http://wearesocial.sg/blog/2015/01/digital-social-mobile-2015/</ref>, Twitter is growing the fastest in Asia Pacific and Singaporeans are one of the most active social media consumers in the world, with the world’s second highest social penetration rate in Singapore at 59%, more than double the global average of 26%. Singaporeans are also more connected to the Internet as compared to the rest of the world on average, with an Internet penetration rate is 73%, above the global average of 35%. There are an estimated 200,000 Twitter users in Singapore.<ref>Yap, J. (2014, June 4). How many Twitter users are there in Singapore? Retrieved April 22, 2015, from https://vulcanpost.com/10812/many-twitter-users-singapore/</ref> This represents a great source of data that we can analyse and derive valuable insights from.  
+
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">We live today, in what could be best described as the age of consumerism, where, what the consumer increasingly looks for, is information to distinguish between products. With this rising need for expert opinion and recommendations, crowd-sourced review sites have brought forth one of the most disruptive business forces of modern age. Since Yelp was launched in 2005, it has been helping customers stay away from bad decisions while steering towards good experiences via a 5-star rating scale and written text reviews. With its vast database of reviews, ratings and general information, Yelp not only makes decision making for its millions of users much easier but also makes its reviewed businesses more profitable by increasing store visits and site traffic.<br>
<br>
+
 
However, harnessing big data is challenging as data lacks structure and context. Computers cannot deal with implicit information as well as humans do. This project aims at qualifying and quantifying the trends in human emotions expressed by Twitter users over a period of time via sentiment analysis, which is the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. In other words, it determines whether a tweet is positive, negative or neutral. This proves vital during the 2014 Israel – Gaza Conflict, with many taking to Twitter for real-time news and updates on the crisis. Majority found Twitter to be a powerful means of expressing their activism against Israel’s brutal campaign in the region.<ref>Gaza takes Twitter by storm. (2014, August 20). Retrieved April 22, 2015, from http://www.vocfm.co.za/gaza-takes-twitter-by-storm/</ref> A deep sentiment analysis of social network data, such as Twitter, could lead to very interesting insights of global public opinion. In this conflict, it could help in engaging more people to help balancing the world’s public opinion, both during the fighting and after the cease fire.<ref>MasterMineDS. (2014, August 6). 2014 Israel – Gaza Conflict: Twitter Sentiment Analysis. Retrieved April 22, 2015, from http://www.wesaidgotravel.com/2014-israel-gaza-conflict-twitter-sentiment-analysis-mastermineds</ref></div>
+
The Yelp Dataset Challenge provides data on ratings for several businesses across 4 countries and 10 cities to give students an opportunity to explore and apply analytics techniques to design a model that improves the pace and efficiency of Yelp’s recommendation systems. Using the dataset provided for existing businesses, we aim to identify the main attributes of a business that make it a high performer (highly rated) on Yelp. Since restaurants form a large chunk of the businesses reviewed on Yelp, we decided to build a model specifically to advice new restaurateurs on how to become their customers’ favourite food destination.  
 +
With Yelp’s increasing popularity in the United States, businesses are starting to care more and more about their ratings as “an extra half star rating causes restaurants to sell out 19 percentage points more frequently”. This profound effect of Yelp ratings on the success of a business makes our analysis even more crucial and relevant for new restaurant owners. Why do some businesses rank higher than others? Do customers give ratings purely based on food quality, does ambience triumph over service or do geographic locations of businesses affect the rating pattern of customers? Through our project we hope to analyse such questions and thereby be able to advice restaurant owners on what factors to look out for. </div>
 
</div>
 
</div>
  
Line 39: Line 40:
 
<div align="left">
 
<div align="left">
  
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Related Work</font></div>==
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Review of Similar Work</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">In the paper ''Sentiment in Twitter Events'' (Thelwall et al., 2011)<ref>[12] Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418.</ref>, more than 34 million tweets are assessed based on whether popular events are typically associated with increases in sentiment strength. Using the top 30 events, determined by a measure of relative increase in term usage, the results give strong evidence that popular events are normally associated with increases in negative sentiment strength and some evidence that peaks of interest in events have stronger positive sentiment than the time before the peak. It seems that many positive events, such as the Oscars, are capable of generating increased negative sentiment in reaction to them. In our project, we can observe whether main events on the national calendar like National Day and Formula 1 Grand Prix Night Race evoke a positive (e.g. national pride) or negative sentiment (e.g. grievances and inconvenience caused by road closures) among Singaporeans.
+
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">1) Visualizing Yelp Ratings: Interactive Analysis and Comparison of Businesses:
  
The paper ''Sentiment Analysis of Twitter Audiences: Measuring the Positive or Negative Influence of Popular Twitterers'' (Bae et al., 2012)<ref>[2] Bae, Y., & Lee, H. (2012). Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology, 63(12), 2521-2535.</ref> analysed over 3 million tweets mentioning, retweeting or replying to the 13 most influential users and investigate whether popular users influence the sentiment changes of their audiences positively or negatively. Results show that most of the popular users such as celebrities have larger positive audiences than negative, while news media such as CNN Breaking News and BBC Breaking News have larger negative audiences. Reason given was that western news agencies mostly report the downside of human nature rather than upbeat and hopeful happenings, whereas social media like Mashable and TechCrunch have audiences that are more positive because they publish useful information and news. In Singapore, the top 20 accounts with the most followers consist of a mix of notable people such as JJ Lin and Lee Hsien Loong and organisations like Straits Times and Singapore Airlines (Chin, 2014)<ref>Chin, D. (2014, December 10). The Straits Times tops Twitter's list for news in Singapore. Retrieved from http://www.straitstimes.com/news/singapore/more-singapore-stories/story/the-straits-times-trumps-twitter’s-list-news-singapore-2</ref>. We can see if the same is observed in Singapore, by looking at the specific peaks in the trend graph, if it is caused by a certain influential Twitter user, such as from a news agency, a prominent blogger or celebrity, or a government leader.
+
The aim of the study is to aid businesses to compare performances (Yelp ratings) with other similar businesses based on location, category, and other relevant attributes.  
  
Based on retweets of more than 165 000 tweets, the paper ''Emotions and Information Diffusion in Social Media—Sentiment of Microblogs and Sharing Behavior'' (Stieglitz et al., 2013)<ref>Stieglitz, S., & Dang-Xuan, L. (2013). Emotions and Information Diffusion in Social Media—Sentiment of Microblogs and Sharing Behavior. Journal of Management Information Systems, 29(4), 217-248.</ref> examines whether sentiment occurring in social media content is associated with a user’s information sharing behaviour. Results show that emotionally charged tweets, in particular those containing political content, are more likely to be disseminated compared to neutral ones. In our project, we can examine if political news or messages from news agency and politicians incite cause emotions to run high and create conversation among Twitter users. In ''A Sentiment Analysis of Singapore Presidential Election 2011 using Twitter Data with Census Correction'' (Choy et al., 2011)<ref>Choy, M., Cheong, M. L., Laik, M. N., & Shung, K. P. (2011). A sentiment analysis of Singapore Presidential Election 2011 using Twitter data with census correction. arXiv preprint arXiv:1108.5520.</ref>, sentiment analysis translate into rather accurate information about the political landscape in Singapore. It predicts correctly the top two contenders in a four-corner fight and that there would be a thin margin between them. This proves that there is value in sentiment analysis in anticipating future events.
+
The visualization focuses on three main parts:<br>
 +
a) Distribution of ratings: A bar chart showing the frequency of each star rating (1 through 5) for a single business. <br>
 +
b) Number of useful votes vs. star rating A scatter plot showing every review for a given business, with the x-position representing the “useful” votes received and y-position representing the for the business. <br>
 +
c) Ratings over time: This chart was the same as Chart 2, but with the date of the review on the x-axis<br>
 +
The final product is designed as an interactive display, allowing users to select a business of interest and indicate the radius in miles to filter the businesses for comparison. We will use this as a base and help expand on some of its shortcomings in terms of usability and UI. We will further supplement this with analysis of our own using other statistical methods to help derive meaning from the dataset.
 +
<br>
 +
<br>
 +
2) Your Neighbors Affect Your Ratings: On Geographical Neighborhood Influence to Rating Prediction <br>
 +
This study focuses on the influence of geographical location on user ratings of a business assuming that a user’s rating is determined by both the intrinsic characteristics of the business as well as the extrinsic characteristics of its geographical neighbors. <br>
 +
The authors use two kinds of latent factors to model a business: one for its intrinsic characteristics and the other for its extrinsic characteristics (which encodes the neighborhood influence of this business to its geographical neighbors).<br>
 +
The study shows that by incorporating geographical neighborhood influences, much lower prediction error is achieved than the state-of-the-art models including Biased MF, SVD++, and Social MF. The prediction error is further reduced by incorporating influences from business category and review content.<br>
 +
We can look to extend our analysis by looking at geographical neighbourhood as an additional factor (that is not mentioned in the dataset) to reduce the variance observed in the data and improve the predictive power of the model.
  
A real-time text-based Hedonometer was built to measure happiness of over 63 million Twitter users over 33 months, as recorded in the paper ''Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter'' (Dodds et al., 2011)<ref>Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12): e26752. doi:10.1371/journal.pone.0026752</ref>. It shows how a highly robust and tunable metric can be constructed with the word list chosen solely by frequency of usage.</div>
+
 
 +
3) Spatial and Social Frictions in the City: Evidence from Yelp
 +
This paper highlights the effect of spatial and social frictions on consumer choices within New York City. Evidence from the paper suggests that factors such as travel time, difference in demographic features etc. tend to influence consumer choice when deciding what restaurant to go to.
 +
 
 +
“Everything is related to everything else, but near things are more related than distant things” (Tobler 1970).</div>
 
</div>
 
</div>
  
Line 54: Line 70:
 
<div align="left">
 
<div align="left">
  
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Motivation and Project Scope</font></div>==
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Motivation</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">Being able to identify what make people happy is arguably one of the most important parts of socio-economic development. Increasingly, many public-opinion polls and government agencies have asked citizens the questions related to happiness and wellbeing in their surveys. In the Gallup poll in 2012, Singapore was ranked the most emotionless society<ref>Clifton, J. (2012, November 21). Singapore Ranks as Least Emotional Country in the World. Retrieved from http://www.gallup.com/poll/158882/singapore-ranks-least-emotional-country-world.aspx</ref> and the least positive country  in the world.<ref>Clifton, J. (2012, December 19). Latin Americans Most Positive in the World, Singaporeans are the least positive worldwide. Retrieved February 9, 2015, from http://www.gallup.com/poll/159254/latin-americans-positive-world.aspx</ref> What makes Singaporeans (un)happy? The goal of this project is to apply the use of social media to measure happiness as a less expensive (in terms of time and resources) method to traditional surveys. The motivation behind this project is to be able to visually represent the data that Living Analytics Research Centre (LARC) has collected from Twitter. The data-set provided comprises of social media data in form of tweets published by Singapore-based Twitter users over several months. The key scope to the project would be to create a dashboard that would distinctively represent the Twitter data to us. The Twitter data will consists of information that is provided Twitter, and on top of that, additional predicted user attributes. We will come up with their granular analysis of change in mood trends, periods of significance (may be weekends or any weekday) and other noteworthy actionable insights coming out of analysis done. The dashboard allows users to quickly view and understand what the data is telling them without delving into the data itself. The focus is therefore to create a replicable and scalable dashboard that can accommodate the large amount of data collected by the LARC team.</div>
+
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">Our personal interest in the topic has motivated us to choose this as our area of research. When planning trips abroad, we explore sites like HostelWorld and TripAdvisor that make planning trips a lot faster and easier; not only is this helpful to customers planning trips but also to the businesses that have been given honest ratings. Since the team consisted students from a Management university, our motivation when choosing this project was more business focused. Our perspective on recommendations was more catered towards how a business can improve its standing on Yelp, and thereby improve its turnover through more visits by customers.<br/>
 +
We believe that our topic of analysis is crucial for the following reasons:<br>
 +
1) It will make the redirection of customers to high quality restaurants much easier and more efficient. <br/>
 +
2) It can encourage low quality restaurants to improve in response to insights about customer demand. <br/>
 +
3) The rapid proliferation of users trusting online review sites and and incorporating them in their everyday lives makes this an important avenue for future research.<br/>
 +
4) Prospective restaurant openers (or restaurant chain extenders) can intelligently decide the location based on the proximity factor to other restaurants around them.
 +
</div>
 
</div>
 
</div>
  
 +
<div align="left">
 +
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Key guiding Questions</font></div>==
 +
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 +
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">
  
 +
1) What constitutes the restaurant industry on Yelp?<br>
 +
2) What are the salient features of these inherent groupings?<br>
 +
3) How important is location within all of this? <br>
 +
4) What are some of the trends that have emerged recently?<br>
 +
5) Can we predict the ratings of new restaurants? <br>
 +
</div>
 +
</div>
 
<div align="left">
 
<div align="left">
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Work Scope</font></div>==
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Project Scope and Methodology</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
The system will include the following:
+
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">
* A timeline based on the tweets provided
+
===Primary requirements===
* The timeline will display the level of happiness as well as the volume of tweets.
+
<br>
* Each point on the timeline will provide additional information like the overall happiness scores, the level of sentiments for each specific category etc.
+
Step 1: Descriptive Analysis - Analysing Restaurants specifically for what differentiates High performers, low performers and Hit or Miss restaurants. For each of the 3 segments mentioned, the following analysis will be done:<br/>
* Linked graphical representations based on the time line
+
* Clustering to analyse business profiles that characterize the market. Explore various algorithms and evaluate each of the algorithms to decide which works best for the dataset.<br>
* Graphs to represent the aggregated user attributes (gender, age groups etc.)
+
<br>
* Comparison between 2 different user defined time periods
+
Step 2: Key factors identification for prescriptive analysis (feature selection) for new restaurants by region, in order to succeed. Regression will be used to identify the most important factors and the model will be validated so that we can analyse how good the model is. This will constitute the explanatory regression exercise.<br>
* Optional toggling of volume of tweets with sentiment timeline graph
+
<br>
 +
Step 3: Spatial Lag regression model. This section will focus on Geospatial Analysis to examine the effect of location of a business on its rating. The goal of this will be to modify the regression model in Step 2 by adding the geospatial components as additional variables to the model. This section will explore the three spatial regression models and use the model that best fits the dataset:<br>
 +
 
 +
* Checking for Spatial Autocorrelation: Spatial dependencies existence will be checked using Moran’s I (or any other spatial autocorrelation index) to see if they are significant.<br>
 +
* Weight Matrix Calibration: Developing the model will involve choosing the Neighbourhood Criteria and consequently developing an appropriate weight matrix to highlight the effect of the lag term in the equation.<br>
 +
* Appropriate model for Spatial dependencies: The Spatial Lag Regression Model and the Spatial Error Regression Models can both be used to understand the effect of location and whether the Dependent variable has dependence, or whether the Error Term does.<br>
 +
 
 +
<br>
 +
Step 4: Build a visualization tool for client for continual updates on business strategy. Focus will be to build a robust tool that helps the client recreate the same analysis on tableau.<br>
 +
<br>
 +
 
 +
===Secondary requirements===
 +
<br>
 +
A. Time series analysis of whether any major trends have emerged in restaurants by region – further decipher the does and don’ts for success <br>
 +
B. As an extension, we will also attempt to predict the rating for new restaurants, thereby informing existing restaurants of potential competition from new openings.<br>
 +
<br>
 +
===Future research===
 +
<br>
 +
Evaluating the importance of review ratings for restaurants – Are they effective to improve ratings? Do restaurants that utilize recommended changes succeed?<br>
 +
Can the ratings and reviews of local experts be assimilated in feature extraction to help improve the predictability of ratings success? We realize that people are social entities and can be heavily influenced by reviews from local experts in their criticism on Yelp. Future research in this area can enrich our analysis for a business as well.</div>
 
</div>
 
</div>
  
  
 
<div align="left">
 
<div align="left">
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Work Scope</font></div>==
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Descriptive Analysis</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
* Data Collection – Collect Twitter data to be analysed from LARC
+
===Exploratory Data Analysis, Data Cleaning and Manipulation===
* Data Preparation – Clean and transform the data into a readable CSV for upload
+
We realized that the dataset actually contained records beyond the past 10 years. Since we did not want our model to be skewed by factors that were only important in the past, we chose to narrow down the dataset by only taking companies with greater than 5 reviews in the past 2 years (from 2013 to 2015), and changed the dataset to reflect that. Given that the mean rating was a rounded average for the ratings for all years, we had to compute the recent mean rating by combining the dataset containing reviews and filtering it by recent ratings, and subsequently mapping it back to the businesses dataset to develop a more recent and precise variable in mean ratings. <br>
* Application Calculations and Filtering – Perform calculations and filters on the data in the app
+
<br>
* Dashboard Construction – Build the application’s dashboard and populate with data  
+
We suspected that it is likely for us to see a variance in the ratings and including that within our analysis would in fact allow us to see if highly rated restaurants get ratings high consistently. For that purpose, we again used the user review dataset and calculated the variance in rating for each business between 2013 and 2015 according to how users rated it.<br>
* Dashboard Calibration – Finalize and verify the accuracy of dashboard visualizations
+
[[Image:ReviewCount.png|center|Review Count]]
* Stress Testing and Refinement – Test software performance whether it meets the minimum requirements of the clients and * perform any optimizations to meet these.
+
<br>
* Literature Study Understand sentiment and text analysis in social media
+
Review Count as a variable was also manipulated to reflect number of reviews for a particular restaurant between 2013 and 2015, and as mentioned above, only restaurants with greater than 5 reviews were included in the dataset.<br>
* Software Learning – Learn how to use and integrate various D3.js / Hicharts libraries, and the dictionary word search provided by the client.
+
[[Image:Recent Mean Rating.png|center|Recent Mean Rating]]
 +
<br>
 +
[[Image:Recent Ratings Variance.png|center|Recent Ratings Variance]]
 +
<br>
 +
We also ventured into basic text analytics to analyse the review text for the restaurants on our dataset. Using R, we cleaned the review data and created word-clouds of reviews for all restaurants, high performing restaurants and low performing restaurants. This was done in order to gain an overview of the high frequency words associated with these restaurant categories. We generated three wordclouds.
 +
<br>
 +
Following are the most frequently used words for ALL the restaurants:
 +
<br>
 +
[[Image:wordcloud-all.jpg|center|Wordcloud3]]
 +
Following are the most frequently used words for low performing restaurants i.e. reviewers who gave a rating of 2 or below.
 +
<br>
 +
[[Image:wordcloud-lowrating.png|center|Wordcloud1]]
 +
Following are the most frequently used words for high performing restaurants i.e. reviewers who gave a rating of 4 or above.
 +
<br>
 +
[[Image:wordcloud-highrating.png|center|Wordcloud2]]
 +
<br>
 +
Given that there was a substantial number of missing values (>50%) for some of the variables, we decided that we needed to remove these variables.
 +
Overall, we removed the 50 variables pertaining to Music, payments, hair types, BYOB, and other miscellaneous variables. Opening hour variables were computed into two new variables for Weekday opening hours and Weekend opening hours. As can be seen, many salient attributes that could contribute to how customers view the restaurant have been removed from the analysis due to bad data quality.<br>
 +
<br>
 +
Since most of the fields consisted of binary data and still did not have all the fields, we decided that replacing missing values was essential for clustering and regression analysis. Therefore we proceeded with imputing missing values with the average score for each category. Since binary variables were changed to continuous data, we essentially took the average and imputed the values as such. <br>
 +
<br>
 +
Restaurants were tagged under a string variable called “Categories”. This variable consisted of tags for a particular business and consisted of fields like “Greek”, “Pizzas”, “Bars”, etc. We found that these categories might be useful in determining the level of success of failure for restaurants. Unfortunately, since we had 192 different categories, we grouped categories according to high performing ones and low performing ones, and created two numerical variables titled “high performing categories” and “low performing categories”. This will hopefully lend greater credibility to the level of analysis and provide a better explanation for the performance of restaurants.
 +
 
 +
===Clustering===
 +
 
 +
a) For K-means and K-Medoids Clustering, all variables must be in numeric form. Therefore, the following changes were made to the different variable types to convert them to numeric form. <br>
 +
b) For Mixed Clustering, no data conversions were required as the algorithm recognises all types of data. Missing values are also acceptable. <br>
 +
However, due to lack of meaningfulness of some variables in the clustering process, such as name, business id, the variables were assigned a weight of 0 to exclude them from analysis.
 +
 
 +
====K-Means Clustering====
 +
 
 +
After converting all variables into numeric form and imputing the missing values with average value, k-means clustering technique was used to cluster the businesses.<br>
 +
<br>
 +
However, due to the nature of the data, k-means clustering is not be the most ideal clustering algorithm. The issues with the technique are as follows:<br>
 +
a) As binary variables were converted into numeric, the resulting clustering means may not be as representative. <br>
 +
b) Due to presence of outliers in the data, the clustering will be skewed.<br>
 +
 
 +
====K-Medoids Clustering====
 +
 
 +
After converting all variables into numeric form and imputing the missing values with average value, k-medoids clustering technique was used to cluster the businesses.<br>
 +
<br>
 +
K-Medoids clustering is a variation of k-means clustering. In K-Medoids clustering, the cluster centres (or “medoids”) are actual points in the dataset. The algorithm begins in a similar way ask-means by assigning random cluster centres. But, in k-medoids the cluster centres are actual data points. A total cost is calculated by using the summing up the following function for all non_medoid-medoid pairs:<br>
 +
<br>
 +
cost(x,c)=∑_(i=1)^d(|xi-ci|) <br>
 +
<br>, where x is any non-medoid data point and c is a medoid data point.<br>
 +
 
 +
In each iteration, medoids within each cluster are swapped with a non-medoid data point in the same cluster. If the overall cost is less (usually defined by Manhattan distance), the swapped non-medoid is declared as new medoid of the cluster. <br>
 +
<br>
 +
Although, k-medoids does protect the clustering process from skewing caused by outliers, it still has other disadvantages. The issues with the K-Medoid technique are:<br>
 +
a) As binary variables were converted into numeric, the resulting clustering means may not be as representative.<br>
 +
b) The computational complexity is large.
 +
 
 +
====Mixed Clustering====
 +
<b>Partitioning around medoids (PAM) with Gower Dissimilarity Matrix</b><br>
 +
As our dataset is a combination of different types of variables. Therefore, a more robust clustering process is needed which does not require the variables to be converted to numeric form. <br>
 +
<br>
 +
Gower dissimilarity technique is able to handle mixed data types within a cluster. It identifies different variable types and uses different algorithms to define dissimilarities between data points for each variable type.<br>
 +
<br>
 +
For dichotomous and categorical variables, if the values for two data points are same, dissimilarity is 0 and vice versa.
 +
For numerical variables, distance is calculated using the following formula:
 +
<br>1- sijk = |xi xj|/Rk
 +
<br>
 +
where sijk is the similarity between data points xi and  xj in the context of variable k, and Rk is the range of values in variable k<br>
 +
<br>The daisy() function in the cluster library in R is used for the above steps.<br>
 +
<br>
 +
The dissimilarity matrix generated is used to cluster with k-medoids (or PAM) as described earlier. The dissimilarity matrix obtained serves as the new cost function for k-medoids clustering. <br>
 +
<br>
 +
We call this two-step process “Mixed Clustering”. This method has a number of datasets:
 +
a) As k-medoids method is used, the clustering is not affected by outliers.<br>
 +
b) Clustering can be done without changing the data types. <br>
 +
c) Missing data can also be handled by the Gower dissimilarity algorithm.
 +
 
 +
Elbow Plots:
 +
The following elbow plot was generated using R.<br>
 +
<br>
 +
[[Image:ElbowPlot_for_clustering.png|center|Elbow Plot for clustering]]<br>
 +
As there is a clear break at 4 number of clusters, we proceeded to carry out clustering with 4 clusters.<br>
 +
<br>
 +
 
 +
Mixed Clustering:
 +
<br>
 +
[https://public.tableau.com/profile/piyush.pritam.sahoo#!/vizhome/ClusteringenTableau/Story1 Published Tableau page]
  
 
</div>
 
</div>
Line 94: Line 227:
 
<div align="left">
 
<div align="left">
  
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Deliverables</font></div>==
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Sentiment Analysis</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
* Project Proposal
+
===Motivation===
* Mid-term presentation
+
Upon preliminary regression analysis, we found the following results:
* Mid-term report
+
<br>
* Final presentation
+
[[Image:Regression interim.png|center|Regression Interim Results]]
* Final report
+
<br>
* Project poster
+
We felt that the Adjusted R square could be improved. Furthermore, we did not consider the content of a review within this calculation. We therefore wanted to include some of the salient features of that make up a review. To do this, we decided to undertake basic sentiment analysis.
* A web-based platform hosted on OpenShift.
+
 
</div>
+
===Approach===
 +
There are various methods that can be used for Sentiment analysis. The following Wikipedia link provides a good summary for the same:<br>
 +
https://en.wikipedia.org/wiki/Sentiment_analysis<br>
 +
<br>
 +
While Keyword spotting was initially a good enough heuristic choice, we sought to expand further from there.
 +
Among the competing methods used, we therefore chose the Lexical Affinity model. Specifically, we sought to compute a polarity variable for each review provided for the business. As before, a subset of reviews was chosen (for restaurants in Arizona and reviews between 2013 to 2015).<br>
 +
<br>
 +
In order to incrementally build on this lexical prediction, we chose a simple polarity as our first step by looking at the difference of positive words and negative words. It was important to choose the right method in selecting the positive and negative words, so we utilized the library developed in widely cited papers on the lexical method. The library for positive and negative words can be found here:<br>
 +
[[File:Positive words.txt|Positive Words]]<br>
 +
[[File:Negative words.txt|Negative words]]<br>
 +
As can be seen, the number of positive words is around 2006, with negative words being around 4783. These lists also include commonly misspelled words in opinions that can be associated with opinion based reviews in past research papers. The two files above cite the source of these words as well.<br>
 +
The results of the additional variable were as follows:<br>
 +
* Mean Sentiment - 4.15
 +
* Variance in sentiment - 22.02
 +
* Correlation with Mean Average Rating - 0.58<br>
  
 +
===Limitations===
 +
There are number of limitations to the aforementioned approach. For instance, this approach does not include negaters, amplifiers and decreasers that inflate, reverse or deflate the emotion in an opinion. This means that the final solution tends to not be as robust as it could be. <br>
 +
Furthermore, this analysis does not include sarcasm and emoticons in the analysis to convey emotion, making the analysis limited in being able to show the rating. <br>
 +
Another limitation is that we have not included any topic extraction from the reviews. If the sentiments could somehow be linked to components of a business, the salient features of that business could be extracted. For example, if a business is rated 2 and its service is consistently criticized, it could be a major source of analysis in explaining the variation between this business and another business rated as 2.5.<br>
 +
===Future Work===
 +
Continiung on from the Limitations, this analysis should basically look to address all of the limitations as mentioned above. Furthermore, future research must also include the variety of mixed models developed by researchers in several academic papers to predict ratings of restaurants, so that more of the variance is explained.<br>
  
 +
</div>
 
<div align="left">
 
<div align="left">
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Introduction and Project Background</font></div>==
+
 
 +
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Feature Extraction and Regression Analysis</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
===Overview===
+
===Approach===
 +
<b>Step 1: Stepwise Regression</b>
  
 +
Due to over 50 variables being part of our dataset, we realized that simply doing a regression may not yield the most representative results. Furthermore, the models may be over-fitted and may hence cause problems when predicting ratings for the entire dataset.
  
 +
Therefore we started with Stepwise Regression. Stepwise regression is a semi-automated process of building a model by successively adding or removing variables based solely on a certain criteria of their estimated coefficients. There are various techniques that set the criteria to do the same:
  
===Demographics===
+
* Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.
 +
* Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible.
 +
* Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded.<br>
  
 +
We chose to use Backward selection for our model because we had a very large set of variables and that the purpose of the subset selection was to remove variables that did not significantly add to our analysis. In terms of criteria, we chose BIC.<br>
  
 +
Since we knew that this was not going to be our final model, we did not spend a lot of time delving into other stepwise techniques and picking the best one.  <br>
  
===Word Association===
+
<b>Step 2: Multiple Regression</b>
  
 +
The next step is to use Multiple Regression to test the significance of the variables which came as an output for Stepwise Regression.<br>
 +
<br>
 +
When we first saw the results, we realized that some of the coefficients were not significant. This is understandable since Stepwise regression tends to be prone to overfitting. We iteratively removed the variables that were insignificant to finally arrive at the final results for each of the clusters identified.
  
</div>
+
===Findings===
  
 +
====Iteration 2: Final findings====
 +
After including the wordcount and sentiment score as two additional variables in the equation, the results changed quite significantly:<br>
 +
[[File:Regression Results 1.png|250px|center|Regression Model Results]]<br>
 +
[[File:Regression results 2.png|500px|center|Regression coefficient results]]<br>
  
<div align="left">
+
*Adjusted R-square increases from 54% to 65% by virtue of including wordcount and sentiment scores into the model. This is a dramatic increase and expected since the sentiment plays a big part in explaining the ratings offered to a business. This is further substantiated where a high correlation of the sentiment score (r = 0.58) is observed with Recent Mean Ratings.
==<div style="background: #c0deed; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #0084b4 solid 32px;"><font color="black">Methodology</font></div>==
+
*The lesser the variance, the better your score. This basically indicates that good restaurants have consistently good reviews, owing possibly to consistency in service or network effect of users in affecting the perception of other users when they visit the restaurant. Either way, if you're in the good books, you'll tend to stay there.
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px;  font-family:Helvetica Neue; font-size: 15px">
+
*Groups and outdoor seating have negative coefficients, possibly suggesting that more exclusive restaurants tend to have higher ratings. Typically, fine dining or up-scale restaurants are known to have these traits, and we hence hypothesize that this may well be the case here.<br>
  
<h3>Dashboard</h3>
+
====Testing Robustness====
 +
In order to test the robustness of our prediction formula, we performed the regression analysis on a training dataset which comprised of 60% of the dataset, with 20% allocated to training and test each.
 +
[[File:Training validation test.png|center|Training, Validation, Test]]
 +
The results of the 3 runs are as follows:<br>
 +
[[File:Results regression final.png|800px|center|Regression results]]
 +
As we can see, the results have an R-square which is consistently around 65%, proving that the prediction formula is robust in calculating the results.
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">The key aim of this project is to allow the user to be able to explore and analyse the happiness level of the targeted subjects based on a given set of tweets. Tweets are a string of text made up of 140 characters. Tweets may contain shorten URLs, tags (@xyz) or trending topics (#xyz).
+
===Assumptions===
  
The interactive visual model prototype should allow the user to be able to see the past tweets based upon certain significant events and derive a conclusion from the results shown. To be able to do this, we will propose the following methodology.  
+
In analysing the results of the regression, we needed to check the assumptions that we made along the way and whether, in fact, they held true at the end of the regression. This would help us moderate our findings so that we don’t overstate the robustness of the results derived. The assumptions of a multiple linear regression are as follows:<br>
Tweet data will be provided to us from the user via uploading a csv file containing the tweets in the JSON format.
+
1) Linear relationship and additivity <br>
 +
2) Multivariate Normality<br>
 +
3) No or little multicollinearity<br>
 +
4) No auto-correlation<br>
 +
5) Homoscedasticity
  
First, we will first display an overview of the tweets that we are looking at. Tweets will be aggregated into intervals based upon the span of tweets’ duration as given in the file upload. Each tweet will have a ‘happiness’ score tagged to it. “Happiness” score is derived from the study at Hedometer.org. Out of the 10,100 words that have a score tagged to it, some of them may not be applicable to words on Twitter. (Please refer to the study to find out how the score is derived). Words that are not applicable will not be used to calculate the score of the tweet and will be considered as a stop/neutral word on the application.
+
</div>
  
To visualise the words that are mentioned in these tweets, we will use a dynamically generated word cloud. A word cloud is useful in showing the users which are the words that are commonly mentioned in the tweets. The more a particular word is mentioned, the bigger it will appear on the word cloud. Stop/neutral words will be removed to ensure that only relevant words show up on the tag cloud. One thing to note is that the source of the text is from Twitter, which means that depending on the users, these tweets may contain localized words which may be hard to filter out. The list of stop words that we will be using to filter will be based upon this list.
+
<div align="left">
  
Secondly, there is a list of predicted user attributes that is provided by the client. Each line contains attributes of one user in JSON format. The information is shown below:
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Spatial Lag Analysis</font></div>==
* id: refers to twitter id
+
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
* gender
 
* ethnicity
 
* religion
 
* age_group
 
* marital_status
 
* sleep
 
* emotions
 
* topics
 
  
This predicted user attributes will be displayed in the 2nd segment where the application allows users to have a quick glance of the demographics of the users.  
+
===Approach===
 +
We believe that neighbourhood and location have a role to play in the overall star ratings of a restaurant in the Yelp dataset. This is why our group has forayed into exploring the spatial lag model for our project. ‘Tobler’s first law of geography encapsulates this situation: ‘‘everything is related to everything else, but near things are more related than distant things.’’ In context of our project, we suspect that the average rating of a neighbourhood affects the star rating of any restaurant within that area.
  
Third, we will also display the score of the words mentioned based upon the happiness level. This will allow the user to quickly identify the words that are attributing to the negativity or positivity of the set of tweets.
+
<b> Step 1: Set Neighbourhood criteria </b>
  
The entire application will entirely be browser based and some of the benefits of doing so include:
+
Deciding the neighbourhood criteria is critical for building the weights matrix. We have chosen Distance as our criteria which takes distance between two data points as a relative measure of proximity between neighbours. So the Weights Matrix is populated by values in terms of miles or kilometres or any other unit of distance. On the other hand, the contiguity criteria divides the data points into blocks and creates binary values for the weights matrix with 1 referring to 'neighbours' sharing a common boundary (adjacency factor) and 0 referring to distant businesses or 'not neighbours'. The third criteria is a more complex version of the first two which must only be set if the first two do not work.  
* Client does not need to download any software to run the application
 
* It clean and fast as most of the people who own a computer would probably have a browser installed by default
 
* It is highly scalable. Work is done on the front-end rather than on the server which may be choked when handling too many requests.
 
  
HTML5 and CSS3 will be used primarily for the display. Javascript will be used for the manipulation of the document objects front-end. Some of the open source plugins that we will be using includes:
+
Once the criteria has been decided based on the needs of the dataset we move on to the next step.
* Highchart.js – a visualisation plugin to create charts quickly.
 
* Jquery –  a cross-platform JavaScript library designed to simplify the client-side scripting of HTML
 
* Openshift – Online free server for live deployment
 
* Moment.js – date manipulation plugin</div>
 
  
<h3>Machine Learning</h3>
+
<b> Step 2: Create Weights Matrix </b>
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">Given the  limitations of the Happiness index score by Hedonometer, we are attempting to use these sample tweets to learn and generate a more robust set of lexicon/dictionary. Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions rather than following a strictly static program.
+
The Weights Matrix summarises the relationship between n spatial units. Each spatial weight Wij represents the "spatial influence" of unit j on unit i. In our case the row and columns of our square matrix will have each unit on the two axis as being a restaurant with the diagonal being zero. Once the matrix has been created, it needs to be row standardised. Row standardisation is used to create proportional weights in cases where businesses have an unequal number of neighbours. It involves dividing each cell unit in a row by the sum of all neighbour weights (all values in that row) for that business.
  
This dictionary will be built on top of the research done by Hedonometer as use their dictionary as a starting point. To calculate the score of a particular tweet, words that appears in a given tweet and in the Hedonometer dictionary are used to calculate the overall happiness score of the entire tweet. To determine whether a tweet is positive , the overall score of the tweet has to be more than 5 (center score in the happiness index) multiplied by the number of words that coincide in the dictionary, and less than that amount to be considered negative. Based on a given set of sample tweets, we track the number of times a particular word appears in a "positive" tweet and the number of times it appears in a "negative" tweet. The percentage in which it appears positive will be how positive it is against other words. On top of that, words that were previously not documented will also be included and their score counted as well.</div>
+
<b> Step 3: Check Spatial Autocorrelation</b>
  
<h3>Lexical Affinity</h3>
+
Next step involves checking the need for a spatial model. When do we decide that a Linear regression is not enough to predict our ratings and that our dependent variables may be spatially lagged? We use the Moran's Index  or Geary's C to make this decision. The index of spatial autocorrelation we use is Moran's I which involves the computation of cross-products of mean-adjusted values that are geographic neighbours (i.e.,covariation), that ranges from roughly (–1, –0.5) to nearly 0 for negative, and nearly 0 to approximately 1 for positive, spatial autocorrelation, with an expected value of –1/(n – 1) for zero spatial autocorrelation, where n denotes the number of units.
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">Another limitation of the Hedometer is that it only considers the score of one word at a time, which can paint a very different picture if we were to look at words association. Take for example, a tweet " I dislike New York" may seem negative to a human observer, but neutral to the machine as the scores of "dislike" and "new" cancels one another out, or in the case of "That dog is pretty ugly", where "pretty" and "ugly" cancels one another out. Thus, we need to associate words together to understand the tweet a little better.
+
We used R (function daisy) to compute the index (= 0.9409) which turns out to be significant for our model. Thus we can conclude that there is some spatial interaction going on in the data.
  
Lexical Affinity assigns arbitrary words a probabilistic affinity for a particular topic or emotion. For example, ‘accident’ might be assigned a 75% probability of indicating a negative event, as in ‘car accident’ or ‘hurt in an accident’. There are a few lexical affinity types that share high co-occurrence frequency of their constituents [http://etd.uwaterloo.ca/etd/elterra2004.pdf]:
+
<b> Step 4: Choose the appropriate Model  </b>
* grammatical constructs (e.g. “due to”)
 
* semantic relations (e.g. “nurse” and “doctor”)
 
* compounds (e.g. “New York”)
 
* idioms and metaphors (e.g. “dead serious)
 
  
The way to do this is to first determine or define the support and confidence threshold that we are willing to accept before associating words with one another. As a rule of thumb, we will go ahead with 75%.
+
Now that we know for sure that we have strong spatial autocorrelation we must choose an appropriate model to explain it. The table below summarises the main differences between the Spatial Lag and Spatial Error Model. The Lagrange Multiplier Test is used to mathematically compute the significance of using each model. So far, we suspect that the Spatial Lag Model will be more relevant for our project.
The support of a bigram (2 words) is defined as the proportion of all the set of words which contains these 2 words. Essentially, it is to see if these 2 words occur sufficient number of time to consider the pairing significant. The confidence of a rule is defined by the proportion of these 2 words occurring over the number of times tweets containing the former of these words occurs. Each tweet may contain more than 1 pairing. For example, "It's a pleasant and wonderful experience" yields 3 pairings where "[Pleasant,wonderful][pleasant,experience][wonderful,experience]" can be grouped. Once we have determine the support and confidence level of each of these pairings, we will be able to generate a new dictionary containing these pairings to be run onto new data.</div>
 
  
<h3>Testing our New Dictionary</h3>
+
[[Image:LagvsError.png|center|Spatial Lag vs Spatial Error]]
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">To determine the accuracy of the dictionary, human test subjects will be employed to judge whether or not the dictionary is in fact effective in determining the polarity of the tweet. Each human subject will be given 2 tweets to judge, with each of these tweets having a pre-defined score after running through the new dictionary. If the human subject's perception of the 2 tweets coincides with that of the dictionary, the test will be given a positive, else a negative is awarded. A random sample of 100 users will be chosen to do at least 10 comparisons each. At the end of these tests, we will calculate the number of positives over the total tests done. The proportion will determine the accuracy of our dictionary.</div>
+
<b> Step 5: Build Spatial Regression Model </b>
</div>
 
  
 +
The final and conclusive step would be to build the Spatial Regression model which incorporates a spatial dependence. This
 +
is done by addng a 'spatially lagged' dependent variable on the right hand side of the regression equation. The model now looks like this:
 +
y= ρWy + xβ + ε
 +
(1-ρW)y= xβ + ε
  
<div align="left">
+
where y= restaurant rating
==<div style="background: #c0deed; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #0084b4 solid 32px;"><font color="black">Limitations & Assumptions</font></div>==
+
ρ= spatial correlation parameter
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; font-family:Helvetica Neue; font-size: 15px ">
+
W= Spatial weights
 +
x= other attributes
 +
β= coefficient of correlation
 +
ε = error term
  
===What Hedonometer Cannot Detect===
+
===Spatial Autocorrelation===
  
==== Negation Handling ====
+
The output for Moran's I test using the distance criteria 1/d (inverse of distance in km) to construct the weights matrix can be seen below. Since the number for Moran's I is close to zero (0.028), this suggests that there is almost no spatial autocorrelation in our dataset.
 +
In order to further explore our results, we changed the criteria for the distance matrix several times in order to check for spatial dependencies. Following are some of the criteria used:
 +
* inverse of distance squared
 +
* inverse of distance raised to the power of 6
 +
* contiguity matrix
 +
Despite changing the weights criteria, Moran's index only increased marginally. Essentially, this meant that the star ratings of Yelp restaurants were spatially independent of their neighbour's ratings. To completely rule out any chances of spatial interaction we tested for correlation in other measures like count of ratings, ratio of high/low ratings, variance of ratings. As expected, once again, there was no spatial dependence. <br>
  
 +
[[Image:Moran'sI.png|center|Moran's I]]
  
 +
'''Conclusion:''' <br>
  
 +
Due to the low value of Moran's Index for the different depenedent variables we contructed the model for, we can conclude that our dataset has no spatially correlated data-points i.e. Geographical location in terms of proximity with certain neighbours does not play much of a role in star ratings of restaurants. <br>
  
==== Abbreviations, Smileys/Emoticons & Special Symbols ====
+
In summary, the steps achieved for the Spatial Lag model are as follows:
  
 +
[[Image:Spatial_lag_steps_achieved.png|center|Spatial lag steps achieved so far]]
  
 +
<div align="left">
  
==== Local Languages & Slangs (Singlish) ====
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Time Series and Trend Analysis</font></div>==
 +
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 +
====Overview====
 +
Our team delved into time series to examine the structure of the Yelp reviews and identify relevant patterns in the same. We also explored possible patterns in sentiments and ratings for key topics and identified patterns in the occurrences of Cool, Funny and Useful reviews as voted by users over the years. We believe that understanding the patters in the Yelp reviews will enable businesses to understand customer preferences over time, the changing preferences of the customers as well as conduct basic forecasting. To simply the above, we built an Application using R Shiny which lets the customer to understand patterns in reviews for any topic (input by the user) and visualise the corresponding forecasts. <br>
 +
====Understanding Patterns – Decomposition====
 +
Decomposition aims to separate the underlying patterns in the data from randomness. Generally, decomposition procedures separate the time into three major components that influence the value in the time series:<br>
 +
*Trend: Represents the increase, decrease, or stationarity of the time series
 +
*Seasonal: Represents the variation of the time series by seasons (usually, months)
 +
*Randomness: The remaining unexplained component of the time series after removing trend and seasonality
 +
The time series is decomposed as follows:<br>
 +
<br>
 +
[[File:Picture1.png|650px|center|Formula]]
 +
<br>
 +
<b>Types of Decomposition Models:</b><br>
 +
*Additive Model: This model is used when the magnitude of the seasonal fluctuation does not vary with the level of the series. In the example shown below for the sale of general merchandise in the US, the magnitude of variation remain the same over the years. Hence, it is an additive model.
 +
For additive models, the time series is a sum of its components.
 +
[[File:Picture2.png|400px|center|Chart1]]
 +
<br>
 +
*Multiplicative Model: This model is used when the magnitude of the seasonal fluctuation varies with the level of the series. In the example shown below for the Number if DVDs sold in the US, the magnitude of variation varies over the years. Hence, it is an multiplicative model.
 +
For multiplicative models, the time series is a product of its components.
 +
[[File:Picture3.png|400px|center|Chart2]]
 +
<br>
 +
<b>Forecasts from Decomposition – Naïve Method:</b>
 +
Decomposition not only helps to understand the composition of the time series but also helps to project the time series into the future and forecast. For our forecast, we used the combination naïve method and seasonal naïve method.
 +
*Forecast of the De-Seasonalised Component: The Naïve method first conducts forecasting for the de-Seasonalised component. De-Seasonalised Component refers the remaining time series after removing the seasonal component. The forecast can be done either by random walk drift method or Holt method.<br>
 +
[[File:Picture4.png|400px|center|Deseasonalized Component]]<br>
 +
*Forecast of the Seasonal Component: The Seasonal Naïve method does forecasting for the Seasonal Component based on the past values of the seasonal data. It assumes that the seasonality is constant or is changing very slowly.
 +
[[File:Picture5.png|400px|center|Seasonalized Component]]<br>
  
 +
====Analysis====
 +
Our analysis involved exploration and forecasting of the time series by type of review and by four attributes related to reviews, ratings, and sentiments.<br>
 +
*Types of Reviews
 +
*All
 +
*Cool Reviews
 +
*Funny Reviews
 +
*Useful Reviews
 +
The Cool, Funny, and Useful Reviews were calculated using a 25% rule. As users vote for reviews based on the three criteria, we assumed that if the number of votes for a type (let say, cool) is at least 25% of the total votes received (cool votes + funny votes + useful votes), it is considered as a Cool Review. This is true for Funny and Useful Reviews.<br>
 +
<b>Attributes</b><br>
 +
*Total Review Count: Sum of the reviews by month and year
 +
*Proportion of Total Reviews: Proportion of the reviews containing the input word (topic) by month and year
 +
*Average Stars: Mean of rating by month and year
 +
*Average Sentiments: Mean of Sentiments by month and year
 +
<b>Assumptions:</b><br>
 +
We have made two major assumptions in pour analysis of the time series. <br>
 +
1) The analysis assumes that all the time series are Additive. This is because, in our initial exploration of the time series, most of the time series were additive.<br>
 +
[[File:Picture6.png|500px|center|Chart3]]<br>
 +
2) The analysis assumes that the time series is Non-Stationary. Non-Stationary time series refers to the time series which contains perceptible trends and seasonality over time. A time series is Stationary when there is no such perceptible trends or seasonality. <br>
 +
[[File:Picture7.png|400px|center|Chart4]]<br>
 +
[[File:Picture8.png|400px|center|Chart5]]<br>
 +
<b>R Functions:</b><br>
 +
*stl(){stats}: Seasonal Decomposition of Time Series by Loess
 +
*forecast{forecast} ; method = ‘naïve’
 +
**R Shiny: We chose R Shiny over Tableau because
 +
**Tableau has limited functionalities for time series
 +
**There is no control over forecast method in Tableau.
  
==== Ambiguity ====
 
  
 +
</div>
 +
<div align="left">
  
==== Sarcasm ====
+
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Limitations and Assumptions</font></div>==
 +
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
  
=== Project Overall ===
+
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px"> In doing our analysis, we have overall concluded below some of the major limitations we can foresee (and are experiencing) from this project:
  
 
{| class="wikitable" style="margin-left: 10px; font-family:Helvetica Neue; font-size: 15px"
 
{| class="wikitable" style="margin-left: 10px; font-family:Helvetica Neue; font-size: 15px"
|-! style="background: #0084b4; color: white; text-align: center;" colspan= "2"
+
|-! style="background: #efefef; color: black; text-align: center;" colspan= "2"
 
| width="50%" | '''Limitations'''
 
| width="50%" | '''Limitations'''
 
| width="50%" | '''Assumptions'''
 
| width="50%" | '''Assumptions'''
 
|-
 
|-
| Insufficient predicted information on the users (location, age etc.)
+
| Limited data points on businesses and cities
| Data given by LARC is sufficiently accurate for the user
+
| Project methodology will be scalable for looking at regional trends
 
|-
 
|-
| Fake Twitter users
+
| Limited action-ability of insights since companies may not care about Yelp ratings
| LARC will determine whether or not the users are real or not
+
| Project findings will help set priorities for improvement for business owners
 
|-
 
|-
| Ambiguity of the emotions
+
| Businesses attribute may not be completely accurate
| Emotions given by the dictionary (as instructed by LARC) is conclusive for the Tweets that is provided
+
| Assuming that data has been updated as accurately as possible
 
|-
 
|-
| Dictionary words limited to the ones instructed by LARC
+
| Defining business categories
| A comprehensive study has been done to come up with the dictionary
+
| Assuming business tags under categories are comprehensive for the competitive set
 
|}
 
|}
 +
</div>
 
</div>
 
</div>
  
  
 
<div align="left">
 
<div align="left">
==<div style="background: #c0deed; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #0084b4 solid 32px;"><font color="black">ROI Analysis</font></div>==
+
 
 +
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Deliverables</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 +
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">
 +
* Project Proposal
 +
* Mid-term presentation
 +
* Mid-term report
 +
* Final presentation
 +
* Final report
 +
* Project poster
 +
* Project Wiki
 +
* Visualization tool on Tableau:
 +
[https://public.tableau.com/profile/publish/finalmasterv/RegionalAnalysis#!/publish-confirm Viz 1]
 +
[https://public.tableau.com/profile/publish/finalmasterv/Yourefunnycooluseful#!/publish-confirm Viz 2]
 +
[https://public.tableau.com/profile/publish/finalmasterv/Dashboard4#!/publish-confirm Viz 3]
 +
</div>
 +
</div>
  
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">As part of LARC’s initiative to study the well-being of Singaporeans, this dashboard will be used as springboard to visually represent Singaporeans on the Twitter space and identify the general sentiments of twitter users based on a given time period. This may provide one of the useful information about people's subjective well-being which helps realise the visions of the smart nation initiative by the Singapore government to understand the well-being of Singaporeans. This project may be a standalone or a series of projects done by LARC.</div>
+
<div align="left">
 +
 
 +
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Work Scope</font></div>==
 +
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
 +
<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px">
 +
Through this project we are hoping to build to an interactive dashboard as a solution to the ratings and recommendations system Dataset Challenge by Yelp. This will be in addition to the insights developed from statistical and machine learning techniques that can support decision making for businesses. Some areas of research we are looking into are:
 +
* Seasonal Trends
 +
* Spatial Lag Regression Analysis
 +
* Time Series Analysis
 +
* K-Means, K-Medoids and Gower's Method for Clustering
 +
* Explanatory Regression analysis
 +
* Sentiment Analysis
 +
</div>
 
</div>
 
</div>
  
  
<div align="left">
 
==<div style="background: #c0deed; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #0084b4 solid 32px;"><font color="black">Future Work</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px;  font-family:Helvetica Neue; font-size: 15px">
 
  
* Scalable larger sets of data without hindering on time and performance
 
* Able to accommodate real-time data to provide instantaneous  analytics on-the-go
 
</div>
 
  
 
<!--End of Content-->
 
<!--End of Content-->
  
 
<div align="left">
 
<div align="left">
==<div style="background: #c0deed; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #0084b4 solid 32px;"><font color="black">References</font></div>==
+
 
 +
==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">References</font></div>==
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px;  font-family:Helvetica Neue; font-size: 15px">
 
<div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px;  font-family:Helvetica Neue; font-size: 15px">
 
<references />
 
<references />

Latest revision as of 14:11, 19 November 2015

Accuro


HOME

 

ABOUT US

 

PROJECT OVERVIEW

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 


Introduction and Background

We live today, in what could be best described as the age of consumerism, where, what the consumer increasingly looks for, is information to distinguish between products. With this rising need for expert opinion and recommendations, crowd-sourced review sites have brought forth one of the most disruptive business forces of modern age. Since Yelp was launched in 2005, it has been helping customers stay away from bad decisions while steering towards good experiences via a 5-star rating scale and written text reviews. With its vast database of reviews, ratings and general information, Yelp not only makes decision making for its millions of users much easier but also makes its reviewed businesses more profitable by increasing store visits and site traffic.

The Yelp Dataset Challenge provides data on ratings for several businesses across 4 countries and 10 cities to give students an opportunity to explore and apply analytics techniques to design a model that improves the pace and efficiency of Yelp’s recommendation systems. Using the dataset provided for existing businesses, we aim to identify the main attributes of a business that make it a high performer (highly rated) on Yelp. Since restaurants form a large chunk of the businesses reviewed on Yelp, we decided to build a model specifically to advice new restaurateurs on how to become their customers’ favourite food destination.

With Yelp’s increasing popularity in the United States, businesses are starting to care more and more about their ratings as “an extra half star rating causes restaurants to sell out 19 percentage points more frequently”. This profound effect of Yelp ratings on the success of a business makes our analysis even more crucial and relevant for new restaurant owners. Why do some businesses rank higher than others? Do customers give ratings purely based on food quality, does ambience triumph over service or do geographic locations of businesses affect the rating pattern of customers? Through our project we hope to analyse such questions and thereby be able to advice restaurant owners on what factors to look out for.


Review of Similar Work

1) Visualizing Yelp Ratings: Interactive Analysis and Comparison of Businesses:

The aim of the study is to aid businesses to compare performances (Yelp ratings) with other similar businesses based on location, category, and other relevant attributes.

The visualization focuses on three main parts:
a) Distribution of ratings: A bar chart showing the frequency of each star rating (1 through 5) for a single business.
b) Number of useful votes vs. star rating A scatter plot showing every review for a given business, with the x-position representing the “useful” votes received and y-position representing the for the business.
c) Ratings over time: This chart was the same as Chart 2, but with the date of the review on the x-axis
The final product is designed as an interactive display, allowing users to select a business of interest and indicate the radius in miles to filter the businesses for comparison. We will use this as a base and help expand on some of its shortcomings in terms of usability and UI. We will further supplement this with analysis of our own using other statistical methods to help derive meaning from the dataset.

2) Your Neighbors Affect Your Ratings: On Geographical Neighborhood Influence to Rating Prediction
This study focuses on the influence of geographical location on user ratings of a business assuming that a user’s rating is determined by both the intrinsic characteristics of the business as well as the extrinsic characteristics of its geographical neighbors.
The authors use two kinds of latent factors to model a business: one for its intrinsic characteristics and the other for its extrinsic characteristics (which encodes the neighborhood influence of this business to its geographical neighbors).
The study shows that by incorporating geographical neighborhood influences, much lower prediction error is achieved than the state-of-the-art models including Biased MF, SVD++, and Social MF. The prediction error is further reduced by incorporating influences from business category and review content.
We can look to extend our analysis by looking at geographical neighbourhood as an additional factor (that is not mentioned in the dataset) to reduce the variance observed in the data and improve the predictive power of the model.


3) Spatial and Social Frictions in the City: Evidence from Yelp This paper highlights the effect of spatial and social frictions on consumer choices within New York City. Evidence from the paper suggests that factors such as travel time, difference in demographic features etc. tend to influence consumer choice when deciding what restaurant to go to.

“Everything is related to everything else, but near things are more related than distant things” (Tobler 1970).


Motivation

Our personal interest in the topic has motivated us to choose this as our area of research. When planning trips abroad, we explore sites like HostelWorld and TripAdvisor that make planning trips a lot faster and easier; not only is this helpful to customers planning trips but also to the businesses that have been given honest ratings. Since the team consisted students from a Management university, our motivation when choosing this project was more business focused. Our perspective on recommendations was more catered towards how a business can improve its standing on Yelp, and thereby improve its turnover through more visits by customers.

We believe that our topic of analysis is crucial for the following reasons:
1) It will make the redirection of customers to high quality restaurants much easier and more efficient.
2) It can encourage low quality restaurants to improve in response to insights about customer demand.
3) The rapid proliferation of users trusting online review sites and and incorporating them in their everyday lives makes this an important avenue for future research.
4) Prospective restaurant openers (or restaurant chain extenders) can intelligently decide the location based on the proximity factor to other restaurants around them.

Key guiding Questions

1) What constitutes the restaurant industry on Yelp?
2) What are the salient features of these inherent groupings?
3) How important is location within all of this?
4) What are some of the trends that have emerged recently?
5) Can we predict the ratings of new restaurants?

Project Scope and Methodology

Primary requirements


Step 1: Descriptive Analysis - Analysing Restaurants specifically for what differentiates High performers, low performers and Hit or Miss restaurants. For each of the 3 segments mentioned, the following analysis will be done:

  • Clustering to analyse business profiles that characterize the market. Explore various algorithms and evaluate each of the algorithms to decide which works best for the dataset.


Step 2: Key factors identification for prescriptive analysis (feature selection) for new restaurants by region, in order to succeed. Regression will be used to identify the most important factors and the model will be validated so that we can analyse how good the model is. This will constitute the explanatory regression exercise.

Step 3: Spatial Lag regression model. This section will focus on Geospatial Analysis to examine the effect of location of a business on its rating. The goal of this will be to modify the regression model in Step 2 by adding the geospatial components as additional variables to the model. This section will explore the three spatial regression models and use the model that best fits the dataset:

  • Checking for Spatial Autocorrelation: Spatial dependencies existence will be checked using Moran’s I (or any other spatial autocorrelation index) to see if they are significant.
  • Weight Matrix Calibration: Developing the model will involve choosing the Neighbourhood Criteria and consequently developing an appropriate weight matrix to highlight the effect of the lag term in the equation.
  • Appropriate model for Spatial dependencies: The Spatial Lag Regression Model and the Spatial Error Regression Models can both be used to understand the effect of location and whether the Dependent variable has dependence, or whether the Error Term does.


Step 4: Build a visualization tool for client for continual updates on business strategy. Focus will be to build a robust tool that helps the client recreate the same analysis on tableau.

Secondary requirements


A. Time series analysis of whether any major trends have emerged in restaurants by region – further decipher the does and don’ts for success
B. As an extension, we will also attempt to predict the rating for new restaurants, thereby informing existing restaurants of potential competition from new openings.

Future research


Evaluating the importance of review ratings for restaurants – Are they effective to improve ratings? Do restaurants that utilize recommended changes succeed?

Can the ratings and reviews of local experts be assimilated in feature extraction to help improve the predictability of ratings success? We realize that people are social entities and can be heavily influenced by reviews from local experts in their criticism on Yelp. Future research in this area can enrich our analysis for a business as well.


Descriptive Analysis

Exploratory Data Analysis, Data Cleaning and Manipulation

We realized that the dataset actually contained records beyond the past 10 years. Since we did not want our model to be skewed by factors that were only important in the past, we chose to narrow down the dataset by only taking companies with greater than 5 reviews in the past 2 years (from 2013 to 2015), and changed the dataset to reflect that. Given that the mean rating was a rounded average for the ratings for all years, we had to compute the recent mean rating by combining the dataset containing reviews and filtering it by recent ratings, and subsequently mapping it back to the businesses dataset to develop a more recent and precise variable in mean ratings.

We suspected that it is likely for us to see a variance in the ratings and including that within our analysis would in fact allow us to see if highly rated restaurants get ratings high consistently. For that purpose, we again used the user review dataset and calculated the variance in rating for each business between 2013 and 2015 according to how users rated it.

Review Count


Review Count as a variable was also manipulated to reflect number of reviews for a particular restaurant between 2013 and 2015, and as mentioned above, only restaurants with greater than 5 reviews were included in the dataset.

Recent Mean Rating


Recent Ratings Variance


We also ventured into basic text analytics to analyse the review text for the restaurants on our dataset. Using R, we cleaned the review data and created word-clouds of reviews for all restaurants, high performing restaurants and low performing restaurants. This was done in order to gain an overview of the high frequency words associated with these restaurant categories. We generated three wordclouds.
Following are the most frequently used words for ALL the restaurants:

Wordcloud3

Following are the most frequently used words for low performing restaurants i.e. reviewers who gave a rating of 2 or below.

Wordcloud1

Following are the most frequently used words for high performing restaurants i.e. reviewers who gave a rating of 4 or above.

Wordcloud2


Given that there was a substantial number of missing values (>50%) for some of the variables, we decided that we needed to remove these variables. Overall, we removed the 50 variables pertaining to Music, payments, hair types, BYOB, and other miscellaneous variables. Opening hour variables were computed into two new variables for Weekday opening hours and Weekend opening hours. As can be seen, many salient attributes that could contribute to how customers view the restaurant have been removed from the analysis due to bad data quality.

Since most of the fields consisted of binary data and still did not have all the fields, we decided that replacing missing values was essential for clustering and regression analysis. Therefore we proceeded with imputing missing values with the average score for each category. Since binary variables were changed to continuous data, we essentially took the average and imputed the values as such.

Restaurants were tagged under a string variable called “Categories”. This variable consisted of tags for a particular business and consisted of fields like “Greek”, “Pizzas”, “Bars”, etc. We found that these categories might be useful in determining the level of success of failure for restaurants. Unfortunately, since we had 192 different categories, we grouped categories according to high performing ones and low performing ones, and created two numerical variables titled “high performing categories” and “low performing categories”. This will hopefully lend greater credibility to the level of analysis and provide a better explanation for the performance of restaurants.

Clustering

a) For K-means and K-Medoids Clustering, all variables must be in numeric form. Therefore, the following changes were made to the different variable types to convert them to numeric form.
b) For Mixed Clustering, no data conversions were required as the algorithm recognises all types of data. Missing values are also acceptable.
However, due to lack of meaningfulness of some variables in the clustering process, such as name, business id, the variables were assigned a weight of 0 to exclude them from analysis.

K-Means Clustering

After converting all variables into numeric form and imputing the missing values with average value, k-means clustering technique was used to cluster the businesses.

However, due to the nature of the data, k-means clustering is not be the most ideal clustering algorithm. The issues with the technique are as follows:
a) As binary variables were converted into numeric, the resulting clustering means may not be as representative.
b) Due to presence of outliers in the data, the clustering will be skewed.

K-Medoids Clustering

After converting all variables into numeric form and imputing the missing values with average value, k-medoids clustering technique was used to cluster the businesses.

K-Medoids clustering is a variation of k-means clustering. In K-Medoids clustering, the cluster centres (or “medoids”) are actual points in the dataset. The algorithm begins in a similar way ask-means by assigning random cluster centres. But, in k-medoids the cluster centres are actual data points. A total cost is calculated by using the summing up the following function for all non_medoid-medoid pairs:

cost(x,c)=∑_(i=1)^d(|xi-ci|)

, where x is any non-medoid data point and c is a medoid data point.

In each iteration, medoids within each cluster are swapped with a non-medoid data point in the same cluster. If the overall cost is less (usually defined by Manhattan distance), the swapped non-medoid is declared as new medoid of the cluster.

Although, k-medoids does protect the clustering process from skewing caused by outliers, it still has other disadvantages. The issues with the K-Medoid technique are:
a) As binary variables were converted into numeric, the resulting clustering means may not be as representative.
b) The computational complexity is large.

Mixed Clustering

Partitioning around medoids (PAM) with Gower Dissimilarity Matrix
As our dataset is a combination of different types of variables. Therefore, a more robust clustering process is needed which does not require the variables to be converted to numeric form.

Gower dissimilarity technique is able to handle mixed data types within a cluster. It identifies different variable types and uses different algorithms to define dissimilarities between data points for each variable type.

For dichotomous and categorical variables, if the values for two data points are same, dissimilarity is 0 and vice versa. For numerical variables, distance is calculated using the following formula:
1- sijk = |xi – xj|/Rk
where sijk is the similarity between data points xi and xj in the context of variable k, and Rk is the range of values in variable k

The daisy() function in the cluster library in R is used for the above steps.

The dissimilarity matrix generated is used to cluster with k-medoids (or PAM) as described earlier. The dissimilarity matrix obtained serves as the new cost function for k-medoids clustering.

We call this two-step process “Mixed Clustering”. This method has a number of datasets: a) As k-medoids method is used, the clustering is not affected by outliers.
b) Clustering can be done without changing the data types.
c) Missing data can also be handled by the Gower dissimilarity algorithm.

Elbow Plots: The following elbow plot was generated using R.

Elbow Plot for clustering

As there is a clear break at 4 number of clusters, we proceeded to carry out clustering with 4 clusters.

Mixed Clustering:
Published Tableau page


Sentiment Analysis

Motivation

Upon preliminary regression analysis, we found the following results:

Regression Interim Results


We felt that the Adjusted R square could be improved. Furthermore, we did not consider the content of a review within this calculation. We therefore wanted to include some of the salient features of that make up a review. To do this, we decided to undertake basic sentiment analysis.

Approach

There are various methods that can be used for Sentiment analysis. The following Wikipedia link provides a good summary for the same:
https://en.wikipedia.org/wiki/Sentiment_analysis

While Keyword spotting was initially a good enough heuristic choice, we sought to expand further from there. Among the competing methods used, we therefore chose the Lexical Affinity model. Specifically, we sought to compute a polarity variable for each review provided for the business. As before, a subset of reviews was chosen (for restaurants in Arizona and reviews between 2013 to 2015).

In order to incrementally build on this lexical prediction, we chose a simple polarity as our first step by looking at the difference of positive words and negative words. It was important to choose the right method in selecting the positive and negative words, so we utilized the library developed in widely cited papers on the lexical method. The library for positive and negative words can be found here:
File:Positive words.txt
File:Negative words.txt
As can be seen, the number of positive words is around 2006, with negative words being around 4783. These lists also include commonly misspelled words in opinions that can be associated with opinion based reviews in past research papers. The two files above cite the source of these words as well.
The results of the additional variable were as follows:

  • Mean Sentiment - 4.15
  • Variance in sentiment - 22.02
  • Correlation with Mean Average Rating - 0.58

Limitations

There are number of limitations to the aforementioned approach. For instance, this approach does not include negaters, amplifiers and decreasers that inflate, reverse or deflate the emotion in an opinion. This means that the final solution tends to not be as robust as it could be.
Furthermore, this analysis does not include sarcasm and emoticons in the analysis to convey emotion, making the analysis limited in being able to show the rating.
Another limitation is that we have not included any topic extraction from the reviews. If the sentiments could somehow be linked to components of a business, the salient features of that business could be extracted. For example, if a business is rated 2 and its service is consistently criticized, it could be a major source of analysis in explaining the variation between this business and another business rated as 2.5.

Future Work

Continiung on from the Limitations, this analysis should basically look to address all of the limitations as mentioned above. Furthermore, future research must also include the variety of mixed models developed by researchers in several academic papers to predict ratings of restaurants, so that more of the variance is explained.

Feature Extraction and Regression Analysis

Approach

Step 1: Stepwise Regression

Due to over 50 variables being part of our dataset, we realized that simply doing a regression may not yield the most representative results. Furthermore, the models may be over-fitted and may hence cause problems when predicting ratings for the entire dataset.

Therefore we started with Stepwise Regression. Stepwise regression is a semi-automated process of building a model by successively adding or removing variables based solely on a certain criteria of their estimated coefficients. There are various techniques that set the criteria to do the same:

  • Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.
  • Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible.
  • Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded.

We chose to use Backward selection for our model because we had a very large set of variables and that the purpose of the subset selection was to remove variables that did not significantly add to our analysis. In terms of criteria, we chose BIC.

Since we knew that this was not going to be our final model, we did not spend a lot of time delving into other stepwise techniques and picking the best one.

Step 2: Multiple Regression

The next step is to use Multiple Regression to test the significance of the variables which came as an output for Stepwise Regression.

When we first saw the results, we realized that some of the coefficients were not significant. This is understandable since Stepwise regression tends to be prone to overfitting. We iteratively removed the variables that were insignificant to finally arrive at the final results for each of the clusters identified.

Findings

Iteration 2: Final findings

After including the wordcount and sentiment score as two additional variables in the equation, the results changed quite significantly:

Regression Model Results

Regression coefficient results

  • Adjusted R-square increases from 54% to 65% by virtue of including wordcount and sentiment scores into the model. This is a dramatic increase and expected since the sentiment plays a big part in explaining the ratings offered to a business. This is further substantiated where a high correlation of the sentiment score (r = 0.58) is observed with Recent Mean Ratings.
  • The lesser the variance, the better your score. This basically indicates that good restaurants have consistently good reviews, owing possibly to consistency in service or network effect of users in affecting the perception of other users when they visit the restaurant. Either way, if you're in the good books, you'll tend to stay there.
  • Groups and outdoor seating have negative coefficients, possibly suggesting that more exclusive restaurants tend to have higher ratings. Typically, fine dining or up-scale restaurants are known to have these traits, and we hence hypothesize that this may well be the case here.

Testing Robustness

In order to test the robustness of our prediction formula, we performed the regression analysis on a training dataset which comprised of 60% of the dataset, with 20% allocated to training and test each.

Training, Validation, Test

The results of the 3 runs are as follows:

Regression results

As we can see, the results have an R-square which is consistently around 65%, proving that the prediction formula is robust in calculating the results.

Assumptions

In analysing the results of the regression, we needed to check the assumptions that we made along the way and whether, in fact, they held true at the end of the regression. This would help us moderate our findings so that we don’t overstate the robustness of the results derived. The assumptions of a multiple linear regression are as follows:
1) Linear relationship and additivity
2) Multivariate Normality
3) No or little multicollinearity
4) No auto-correlation
5) Homoscedasticity

Spatial Lag Analysis

Approach

We believe that neighbourhood and location have a role to play in the overall star ratings of a restaurant in the Yelp dataset. This is why our group has forayed into exploring the spatial lag model for our project. ‘Tobler’s first law of geography encapsulates this situation: ‘‘everything is related to everything else, but near things are more related than distant things.’’ In context of our project, we suspect that the average rating of a neighbourhood affects the star rating of any restaurant within that area.

Step 1: Set Neighbourhood criteria

Deciding the neighbourhood criteria is critical for building the weights matrix. We have chosen Distance as our criteria which takes distance between two data points as a relative measure of proximity between neighbours. So the Weights Matrix is populated by values in terms of miles or kilometres or any other unit of distance. On the other hand, the contiguity criteria divides the data points into blocks and creates binary values for the weights matrix with 1 referring to 'neighbours' sharing a common boundary (adjacency factor) and 0 referring to distant businesses or 'not neighbours'. The third criteria is a more complex version of the first two which must only be set if the first two do not work.

Once the criteria has been decided based on the needs of the dataset we move on to the next step.

Step 2: Create Weights Matrix

The Weights Matrix summarises the relationship between n spatial units. Each spatial weight Wij represents the "spatial influence" of unit j on unit i. In our case the row and columns of our square matrix will have each unit on the two axis as being a restaurant with the diagonal being zero. Once the matrix has been created, it needs to be row standardised. Row standardisation is used to create proportional weights in cases where businesses have an unequal number of neighbours. It involves dividing each cell unit in a row by the sum of all neighbour weights (all values in that row) for that business.

Step 3: Check Spatial Autocorrelation

Next step involves checking the need for a spatial model. When do we decide that a Linear regression is not enough to predict our ratings and that our dependent variables may be spatially lagged? We use the Moran's Index or Geary's C to make this decision. The index of spatial autocorrelation we use is Moran's I which involves the computation of cross-products of mean-adjusted values that are geographic neighbours (i.e.,covariation), that ranges from roughly (–1, –0.5) to nearly 0 for negative, and nearly 0 to approximately 1 for positive, spatial autocorrelation, with an expected value of –1/(n – 1) for zero spatial autocorrelation, where n denotes the number of units.

We used R (function daisy) to compute the index (= 0.9409) which turns out to be significant for our model. Thus we can conclude that there is some spatial interaction going on in the data.

Step 4: Choose the appropriate Model

Now that we know for sure that we have strong spatial autocorrelation we must choose an appropriate model to explain it. The table below summarises the main differences between the Spatial Lag and Spatial Error Model. The Lagrange Multiplier Test is used to mathematically compute the significance of using each model. So far, we suspect that the Spatial Lag Model will be more relevant for our project.

Spatial Lag vs Spatial Error

Step 5: Build Spatial Regression Model

The final and conclusive step would be to build the Spatial Regression model which incorporates a spatial dependence. This is done by addng a 'spatially lagged' dependent variable on the right hand side of the regression equation. The model now looks like this: y= ρWy + xβ + ε (1-ρW)y= xβ + ε

where y= restaurant rating ρ= spatial correlation parameter W= Spatial weights x= other attributes β= coefficient of correlation ε = error term

Spatial Autocorrelation

The output for Moran's I test using the distance criteria 1/d (inverse of distance in km) to construct the weights matrix can be seen below. Since the number for Moran's I is close to zero (0.028), this suggests that there is almost no spatial autocorrelation in our dataset. In order to further explore our results, we changed the criteria for the distance matrix several times in order to check for spatial dependencies. Following are some of the criteria used:

  • inverse of distance squared
  • inverse of distance raised to the power of 6
  • contiguity matrix

Despite changing the weights criteria, Moran's index only increased marginally. Essentially, this meant that the star ratings of Yelp restaurants were spatially independent of their neighbour's ratings. To completely rule out any chances of spatial interaction we tested for correlation in other measures like count of ratings, ratio of high/low ratings, variance of ratings. As expected, once again, there was no spatial dependence.

Moran's I

Conclusion:

Due to the low value of Moran's Index for the different depenedent variables we contructed the model for, we can conclude that our dataset has no spatially correlated data-points i.e. Geographical location in terms of proximity with certain neighbours does not play much of a role in star ratings of restaurants.

In summary, the steps achieved for the Spatial Lag model are as follows:

Spatial lag steps achieved so far

Time Series and Trend Analysis

Overview

Our team delved into time series to examine the structure of the Yelp reviews and identify relevant patterns in the same. We also explored possible patterns in sentiments and ratings for key topics and identified patterns in the occurrences of Cool, Funny and Useful reviews as voted by users over the years. We believe that understanding the patters in the Yelp reviews will enable businesses to understand customer preferences over time, the changing preferences of the customers as well as conduct basic forecasting. To simply the above, we built an Application using R Shiny which lets the customer to understand patterns in reviews for any topic (input by the user) and visualise the corresponding forecasts.

Understanding Patterns – Decomposition

Decomposition aims to separate the underlying patterns in the data from randomness. Generally, decomposition procedures separate the time into three major components that influence the value in the time series:

  • Trend: Represents the increase, decrease, or stationarity of the time series
  • Seasonal: Represents the variation of the time series by seasons (usually, months)
  • Randomness: The remaining unexplained component of the time series after removing trend and seasonality

The time series is decomposed as follows:

Formula


Types of Decomposition Models:

  • Additive Model: This model is used when the magnitude of the seasonal fluctuation does not vary with the level of the series. In the example shown below for the sale of general merchandise in the US, the magnitude of variation remain the same over the years. Hence, it is an additive model.

For additive models, the time series is a sum of its components.

Chart1


  • Multiplicative Model: This model is used when the magnitude of the seasonal fluctuation varies with the level of the series. In the example shown below for the Number if DVDs sold in the US, the magnitude of variation varies over the years. Hence, it is an multiplicative model.

For multiplicative models, the time series is a product of its components.

Chart2


Forecasts from Decomposition – Naïve Method: Decomposition not only helps to understand the composition of the time series but also helps to project the time series into the future and forecast. For our forecast, we used the combination naïve method and seasonal naïve method.

  • Forecast of the De-Seasonalised Component: The Naïve method first conducts forecasting for the de-Seasonalised component. De-Seasonalised Component refers the remaining time series after removing the seasonal component. The forecast can be done either by random walk drift method or Holt method.
Deseasonalized Component

  • Forecast of the Seasonal Component: The Seasonal Naïve method does forecasting for the Seasonal Component based on the past values of the seasonal data. It assumes that the seasonality is constant or is changing very slowly.
Seasonalized Component

Analysis

Our analysis involved exploration and forecasting of the time series by type of review and by four attributes related to reviews, ratings, and sentiments.

  • Types of Reviews
  • All
  • Cool Reviews
  • Funny Reviews
  • Useful Reviews

The Cool, Funny, and Useful Reviews were calculated using a 25% rule. As users vote for reviews based on the three criteria, we assumed that if the number of votes for a type (let say, cool) is at least 25% of the total votes received (cool votes + funny votes + useful votes), it is considered as a Cool Review. This is true for Funny and Useful Reviews.
Attributes

  • Total Review Count: Sum of the reviews by month and year
  • Proportion of Total Reviews: Proportion of the reviews containing the input word (topic) by month and year
  • Average Stars: Mean of rating by month and year
  • Average Sentiments: Mean of Sentiments by month and year

Assumptions:
We have made two major assumptions in pour analysis of the time series.
1) The analysis assumes that all the time series are Additive. This is because, in our initial exploration of the time series, most of the time series were additive.

Chart3

2) The analysis assumes that the time series is Non-Stationary. Non-Stationary time series refers to the time series which contains perceptible trends and seasonality over time. A time series is Stationary when there is no such perceptible trends or seasonality.

Chart4

Chart5

R Functions:

  • stl(){stats}: Seasonal Decomposition of Time Series by Loess
  • forecast{forecast} ; method = ‘naïve’
    • R Shiny: We chose R Shiny over Tableau because
    • Tableau has limited functionalities for time series
    • There is no control over forecast method in Tableau.


Limitations and Assumptions

In doing our analysis, we have overall concluded below some of the major limitations we can foresee (and are experiencing) from this project:
Limitations Assumptions
Limited data points on businesses and cities Project methodology will be scalable for looking at regional trends
Limited action-ability of insights since companies may not care about Yelp ratings Project findings will help set priorities for improvement for business owners
Businesses attribute may not be completely accurate Assuming that data has been updated as accurately as possible
Defining business categories Assuming business tags under categories are comprehensive for the competitive set


Deliverables

  • Project Proposal
  • Mid-term presentation
  • Mid-term report
  • Final presentation
  • Final report
  • Project poster
  • Project Wiki
  • Visualization tool on Tableau:

Viz 1 Viz 2 Viz 3

Work Scope

Through this project we are hoping to build to an interactive dashboard as a solution to the ratings and recommendations system Dataset Challenge by Yelp. This will be in addition to the insights developed from statistical and machine learning techniques that can support decision making for businesses. Some areas of research we are looking into are:

  • Seasonal Trends
  • Spatial Lag Regression Analysis
  • Time Series Analysis
  • K-Means, K-Medoids and Gower's Method for Clustering
  • Explanatory Regression analysis
  • Sentiment Analysis



References