Difference between revisions of "Team Accuro Project Overview"

Revision as of 19:39, 4 September 2015

Introduction and Background

We live today, in what could be best described as the age of consumerism, where, what the consumer increasingly looks for, is information to distinguish between products. With this rising need for expert opinion and recommendations, crowd-sourced review sites have brought forth one of the most disruptive business forces of modern age. Since Yelp was launched in 2005, it has been helping customers stay away from bad decisions while steering towards good experiences via a 5-star rating scale and written text reviews. With its vast database of reviews, ratings and general information, Yelp not only makes decision making for its millions of users much easier but also makes its reviewed businesses more profitable by increasing store visits and site traffic.

The Yelp Dataset Challenge provides data on ratings for several businesses across 4 countries and 10 cities to give students an opportunity to explore and apply analytics techniques to design a model that improves the pace and efficiency of Yelp’s recommendation systems. Using the dataset provided for existing businesses, we aim to identify the main attributes of a business that make it a high performer (highly rated) on Yelp. Since restaurants form a large chunk of the businesses reviewed on Yelp, we decided to build a model specifically to advice new restaurateurs on how to become their customers’ favourite food destination.

With Yelp’s increasing popularity in the United States, businesses are starting to care more and more about their ratings as “an extra half star rating causes restaurants to sell out 19 percentage points more frequently”. This profound effect of Yelp ratings on the success of a business makes our analysis even more crucial and relevant for new restaurant owners. Why do some businesses rank higher than others? Do customers give ratings purely based on food quality, does ambience triumph over service or do geographic locations of businesses affect the rating pattern of customers? Through our project we hope to analyse such questions and thereby be able to advice restaurant owners on what factors to look out for.

Review of Similar Work

1) Visualizing Yelp Ratings: Interactive Analysis and Comparison of Businesses:

The aim of the study is to aid businesses to compare performances (Yelp ratings) with other similar businesses based on location, category, and other relevant attributes.

The visualization focuses on three main parts:
a) Distribution of ratings: A bar chart showing the frequency of each star rating (1 through 5) for a single business.
b) Number of useful votes vs. star rating A scatter plot showing every review for a given business, with the x-position representing the “useful” votes received and y-position representing the for the business.
c) Ratings over time: This chart was the same as Chart 2, but with the date of the review on the x-axis
The final product is designed as an interactive display, allowing users to select a business of interest and indicate the radius in miles to filter the businesses for comparison. We will use this as a base and help expand on some of its shortcomings in terms of usability and UI. We will further supplement this with analysis of our own using other statistical methods to help derive meaning from the dataset.

2) Your Neighbors Affect Your Ratings: On Geographical Neighborhood Influence to Rating Prediction
This study focuses on the influence of geographical location on user ratings of a business assuming that a user’s rating is determined by both the intrinsic characteristics of the business as well as the extrinsic characteristics of its geographical neighbors.
The authors use two kinds of latent factors to model a business: one for its intrinsic characteristics and the other for its extrinsic characteristics (which encodes the neighborhood influence of this business to its geographical neighbors).
The study shows that by incorporating geographical neighborhood influences, much lower prediction error is achieved than the state-of-the-art models including Biased MF, SVD++, and Social MF. The prediction error is further reduced by incorporating influences from business category and review content.

We can look to extend our analysis by looking at geographical neighbourhood as an additional factor (that is not mentioned in the dataset) to reduce the variance observed in the data and improve the predictive power of the model.

Motivation

Our personal interest in the topic has motivated us to choose this as our area of research. When planning trips abroad, we explore sites like HostelWorld and TripAdvisor that make planning trips a lot faster and easier; not only is this helpful to customers planning trips but also to the businesses that have been given honest ratings. Since the team consisted students from a Management university, our motivation when choosing this project was more business focused. Our perspective on recommendations was more catered towards how a business can improve its standing on Yelp, and thereby improve its turnover through more visits by customers.

We believe that our topic of analysis is crucial for the following reasons:
1) It will make the redirection of customers to high quality restaurants much easier and more efficient.
2) It can encourage low quality restaurants to improve in response to insights about customer demand.

3) The rapid proliferation of users trusting online review sites and and incorporating them in their everyday lives makes this an important avenue for future research.

Project Scope and Methodology

Primary requirements (for “restaurants” and one city only):

Step 1: Descriptive Analysis - Analysing Restaurants specifically for what differentiates High performers, low performers and Hit or Miss restaurants. The analysis will further be segmented into for example region, review count, operating hours, etc. For each of the 3 segments mentioned, the following analysis will be done:
A. Clustering to analyse business profiles that characterize the market. Explore various algorithms and evaluate each of the algorithms to decide which works best for the dataset.
B. Time series analysis of whether any major trends have emerged in restaurants by region – further decipher the does and don’ts for success

Step 2: Key factors identification for prescriptive analysis (feature extraction) for new restaurants by region, in order to succeed. Regression will be used to identify the most important factors and the model will be validated so that we can analyse how good the model is.

Step 3: For each segment (i.e. high performers, low performers and Hit & Miss restaurants), our analysis will include the following:
o Regression to predict the rating for new restaurants regions (through analysis of success factors over time. For example, restaurants that started 2 years ago, and achieved high ratings a year later will be used to test for restaurants that started a year ago and have high ratings now to study patterns in determining a successful business)

Step 4: Build a visualization tool for client for continual updates on business strategy. Focus will be to build a robust tool that helps the client recreate the same analysis on tableau.

Secondary requirements:

Expand and recreate the analysis for all other cities. This analysis will be recreated to include other kinds of businesses eg. Bars, Salons, etc. For some businesses, new methods of analysis such as latent factorization will be employed (especially for those with minimal information on attributes)

Future research:

Evaluating the importance of review ratings for restaurants – Are they effective to improve ratings? Do restaurants that utilize recommended changes succeed?

Can the ratings and reviews of local experts be assimilated in feature extraction to help improve the predictability of ratings success? We realize that people are social entities and can be heavily influenced by reviews from local experts in their criticism on Yelp. Future research in this area can enrich our analysis for a business as well.

Limitations and Assumptions

In doing our analysis, we have overall concluded below some of the major limitations we can foresee from this project:

Deliverables

Project Proposal
Mid-term presentation
Mid-term report
Final presentation
Final report
Project poster
Project Wiki
Visualization tool on Tableau

Introduction and Project Background

Overview

Demographics

Word Association

Methodology

Dashboard

The key aim of this project is to allow the user to be able to explore and analyse the happiness level of the targeted subjects based on a given set of tweets. Tweets are a string of text made up of 140 characters. Tweets may contain shorten URLs, tags (@xyz) or trending topics (#xyz).

The interactive visual model prototype should allow the user to be able to see the past tweets based upon certain significant events and derive a conclusion from the results shown. To be able to do this, we will propose the following methodology. Tweet data will be provided to us from the user via uploading a csv file containing the tweets in the JSON format.

First, we will first display an overview of the tweets that we are looking at. Tweets will be aggregated into intervals based upon the span of tweets’ duration as given in the file upload. Each tweet will have a ‘happiness’ score tagged to it. “Happiness” score is derived from the study at Hedometer.org. Out of the 10,100 words that have a score tagged to it, some of them may not be applicable to words on Twitter. (Please refer to the study to find out how the score is derived). Words that are not applicable will not be used to calculate the score of the tweet and will be considered as a stop/neutral word on the application.

To visualise the words that are mentioned in these tweets, we will use a dynamically generated word cloud. A word cloud is useful in showing the users which are the words that are commonly mentioned in the tweets. The more a particular word is mentioned, the bigger it will appear on the word cloud. Stop/neutral words will be removed to ensure that only relevant words show up on the tag cloud. One thing to note is that the source of the text is from Twitter, which means that depending on the users, these tweets may contain localized words which may be hard to filter out. The list of stop words that we will be using to filter will be based upon this list.

Secondly, there is a list of predicted user attributes that is provided by the client. Each line contains attributes of one user in JSON format. The information is shown below:

id: refers to twitter id
gender
ethnicity
religion
age_group
marital_status
sleep
emotions
topics

This predicted user attributes will be displayed in the 2nd segment where the application allows users to have a quick glance of the demographics of the users.

Third, we will also display the score of the words mentioned based upon the happiness level. This will allow the user to quickly identify the words that are attributing to the negativity or positivity of the set of tweets.

The entire application will entirely be browser based and some of the benefits of doing so include:

Client does not need to download any software to run the application
It clean and fast as most of the people who own a computer would probably have a browser installed by default
It is highly scalable. Work is done on the front-end rather than on the server which may be choked when handling too many requests.

HTML5 and CSS3 will be used primarily for the display. Javascript will be used for the manipulation of the document objects front-end. Some of the open source plugins that we will be using includes:

Highchart.js – a visualisation plugin to create charts quickly.
Jquery – a cross-platform JavaScript library designed to simplify the client-side scripting of HTML
Openshift – Online free server for live deployment
Moment.js – date manipulation plugin

Machine Learning

Given the limitations of the Happiness index score by Hedonometer, we are attempting to use these sample tweets to learn and generate a more robust set of lexicon/dictionary. Machine learning is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions rather than following a strictly static program. This dictionary will be built on top of the research done by Hedonometer as use their dictionary as a starting point. To calculate the score of a particular tweet, words that appears in a given tweet and in the Hedonometer dictionary are used to calculate the overall happiness score of the entire tweet. To determine whether a tweet is positive , the overall score of the tweet has to be more than 5 (center score in the happiness index) multiplied by the number of words that coincide in the dictionary, and less than that amount to be considered negative. Based on a given set of sample tweets, we track the number of times a particular word appears in a "positive" tweet and the number of times it appears in a "negative" tweet. The percentage in which it appears positive will be how positive it is against other words. On top of that, words that were previously not documented will also be included and their score counted as well.

Lexical Affinity

Another limitation of the Hedometer is that it only considers the score of one word at a time, which can paint a very different picture if we were to look at words association. Take for example, a tweet " I dislike New York" may seem negative to a human observer, but neutral to the machine as the scores of "dislike" and "new" cancels one another out, or in the case of "That dog is pretty ugly", where "pretty" and "ugly" cancels one another out. Thus, we need to associate words together to understand the tweet a little better.

Lexical Affinity assigns arbitrary words a probabilistic affinity for a particular topic or emotion. For example, ‘accident’ might be assigned a 75% probability of indicating a negative event, as in ‘car accident’ or ‘hurt in an accident’. There are a few lexical affinity types that share high co-occurrence frequency of their constituents [1]:

grammatical constructs (e.g. “due to”)
semantic relations (e.g. “nurse” and “doctor”)
compounds (e.g. “New York”)
idioms and metaphors (e.g. “dead serious)

The way to do this is to first determine or define the support and confidence threshold that we are willing to accept before associating words with one another. As a rule of thumb, we will go ahead with 75%.

The support of a bigram (2 words) is defined as the proportion of all the set of words which contains these 2 words. Essentially, it is to see if these 2 words occur sufficient number of time to consider the pairing significant. The confidence of a rule is defined by the proportion of these 2 words occurring over the number of times tweets containing the former of these words occurs. Each tweet may contain more than 1 pairing. For example, "It's a pleasant and wonderful experience" yields 3 pairings where "[Pleasant,wonderful][pleasant,experience][wonderful,experience]" can be grouped. Once we have determine the support and confidence level of each of these pairings, we will be able to generate a new dictionary containing these pairings to be run onto new data.

Testing our New Dictionary

To determine the accuracy of the dictionary, human test subjects will be employed to judge whether or not the dictionary is in fact effective in determining the polarity of the tweet. Each human subject will be given 2 tweets to judge, with each of these tweets having a pre-defined score after running through the new dictionary. If the human subject's perception of the 2 tweets coincides with that of the dictionary, the test will be given a positive, else a negative is awarded. A random sample of 100 users will be chosen to do at least 10 comparisons each. At the end of these tests, we will calculate the number of positives over the total tests done. The proportion will determine the accuracy of our dictionary.

Limitations & Assumptions

What Hedonometer Cannot Detect

Negation Handling

Abbreviations, Smileys/Emoticons & Special Symbols

Local Languages & Slangs (Singlish)

Ambiguity

Sarcasm

Project Overall

Limitations	Assumptions
Insufficient predicted information on the users (location, age etc.)	Data given by LARC is sufficiently accurate for the user
Fake Twitter users	LARC will determine whether or not the users are real or not
Ambiguity of the emotions	Emotions given by the dictionary (as instructed by LARC) is conclusive for the Tweets that is provided
Dictionary words limited to the ones instructed by LARC	A comprehensive study has been done to come up with the dictionary

ROI Analysis

As part of LARC’s initiative to study the well-being of Singaporeans, this dashboard will be used as springboard to visually represent Singaporeans on the Twitter space and identify the general sentiments of twitter users based on a given time period. This may provide one of the useful information about people's subjective well-being which helps realise the visions of the smart nation initiative by the Singapore government to understand the well-being of Singaporeans. This project may be a standalone or a series of projects done by LARC.

Future Work

Scalable larger sets of data without hindering on time and performance
Able to accommodate real-time data to provide instantaneous analytics on-the-go

@@ Line 104: / Line 104: @@
 <div align="left">
-==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Work Scope</font></div>==
+==<div style="background: #B22222; padding: 15px; font-family:Helvetica Neue; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #800000 solid 32px;"><font color="white">Limitations and Assumptions</font></div>==
 <div style="border-left: #EAEAEA solid 12px; padding: 0px 30px 0px 18px; ">
-* Data Collection – Collect Twitter data to be analysed from LARC
+<div style="text-align: justify; font-family:Helvetica Neue; font-size: 15px"> In doing our analysis, we have overall concluded below some of the major limitations we can foresee from this project:
-* Data Preparation – Clean and transform the data into a readable CSV for upload
-* Application Calculations and Filtering – Perform calculations and filters on the data in the app
-* Dashboard Construction – Build the application’s dashboard and populate with data
-* Dashboard Calibration – Finalize and verify the accuracy of dashboard visualizations
-* Stress Testing and Refinement – Test software performance whether it meets the minimum requirements of the clients and * perform any optimizations to meet these.
-* Literature Study – Understand sentiment and text analysis in social media
-* Software Learning – Learn how to use and integrate various D3.js / Hicharts libraries, and the dictionary word search provided by the client.
+</div>
 </div>
@@ Line 130: / Line 124: @@
 * Final report
 * Project poster
-* A web-based platform hosted on OpenShift.
+* Project Wiki
+* Visualization tool on Tableau
 </div>