Social Media & Public Opinion - Project Overview
Contents
Introduction and Project Background
Motivation & Project Scope
High-Level Requirements
The system will include the following:
- A timeline based on the tweets provided
- The timeline will display the level of happiness as well as the volume of tweets.
- Each point on the timeline will provide additional information like the overall happiness scores, the level of sentiments for each specific category etc.
- Linked graphical representations based on the time line
- Graphs to represent the aggregated user attributes (gender, age groups etc.)
- Comparison between 2 different user defined time periods
- Optional toggling of volume of tweets with sentiment timeline graph
Work Scope
- Data Collection – Collect Twitter data to be analysed from LARC
- Data Preparation – Clean and transform the data into a readable CSV for upload
- Application Calculations and Filtering – Perform calculations and filters on the data in the app
- Dashboard Construction – Build the application’s dashboard and populate with data
- Dashboard Calibration – Finalize and verify the accuracy of dashboard visualizations
- Stress Testing and Refinement – Test software performance whether it meets the minimum requirements of the clients and * perform any optimizations to meet these.
- Literature Study – Understand sentiment and text analysis in social media
- Software Learning – Learn how to use and integrate various D3.js / Hicharts libraries, and the dictionary word search provided by the client.
Deliverables
- Project Proposal
- Mid-term presentation
- Mid-term report
- Final presentation
- Final report
- Project poster
- A web-based platform hosted on OpenShift.
Dashboard Prototype
Methodology
Dashboard
The interactive visual model prototype should allow the user to be able to see the past tweets based upon certain significant events and derive a conclusion from the results shown. To be able to do this, we will propose the following methodology. Tweet data will be provided to us from the user via uploading a csv file containing the tweets in the JSON format.
First, we will first display an overview of the tweets that we are looking at. Tweets will be aggregated into intervals based upon the span of tweets’ duration as given in the file upload. Each tweet will have a ‘happiness’ score tagged to it. “Happiness” score is derived from the study at Hedometer.org. Out of the 10,100 words that have a score tagged to it, some of them may not be applicable to words on Twitter. (Please refer to the study to find out how the score is derived). Words that are not applicable will not be used to calculate the score of the tweet and will be considered as a stop/neutral word on the application.
To visualise the words that are mentioned in these tweets, we will use a dynamically generated word cloud. A word cloud is useful in showing the users which are the words that are commonly mentioned in the tweets. The more a particular word is mentioned, the bigger it will appear on the word cloud. Stop/neutral words will be removed to ensure that only relevant words show up on the tag cloud. One thing to note is that the source of the text is from Twitter, which means that depending on the users, these tweets may contain localized words which may be hard to filter out. The list of stop words that we will be using to filter will be based upon this list.
Secondly, there is a list of predicted user attributes that is provided by the client. Each line contains attributes of one user in JSON format. The information is shown below:
- id: refers to twitter id
- gender
- ethnicity
- religion
- age_group
- marital_status
- sleep
- emotions
- topics
This predicted user attributes will be displayed in the 2nd segment where the application allows users to have a quick glance of the demographics of the users.
Third, we will also display the score of the words mentioned based upon the happiness level. This will allow the user to quickly identify the words that are attributing to the negativity or positivity of the set of tweets.
The entire application will entirely be browser based and some of the benefits of doing so include:
- Client does not need to download any software to run the application
- It clean and fast as most of the people who own a computer would probably have a browser installed by default
- It is highly scalable. Work is done on the front-end rather than on the server which may be choked when handling too many requests.
HTML5 and CSS3 will be used primarily for the display. Javascript will be used for the manipulation of the document objects front-end. Some of the open source plugins that we will be using includes:
- Highchart.js – a visualisation plugin to create charts quickly.
- Jquery – a cross-platform JavaScript library designed to simplify the client-side scripting of HTML
- Openshift – Online free server for live deployment
- Moment.js – date manipulation plugin
Machine Learning
Lexical Affinity
Lexical Affinity assigns arbitrary words a probabilistic affinity for a particular topic or emotion. For example, ‘accident’ might be assigned a 75% probability of indicating a negative event, as in ‘car accident’ or ‘hurt in an accident’. There are a few lexical affinity types that share high co-occurrence frequency of their constituents [1]:
- grammatical constructs (e.g. “due to”)
- semantic relations (e.g. “nurse” and “doctor”)
- compounds (e.g. “New York”)
- idioms and metaphors (e.g. “dead serious)
The way to do this is to first determine or define the support and confidence threshold that we are willing to accept before associating words with one another. As a rule of thumb, we will go ahead with 75%.
The support of a bigram (2 words) is defined as the proportion of all the set of words which contains these 2 words. Essentially, it is to see if these 2 words occur sufficient number of time to consider the pairing significant. The confidence of a rule is defined by the proportion of these 2 words occurring over the number of times tweets containing the former of these words occurs. Each tweet may contain more than 1 pairing. For example, "It's a pleasant and wonderful experience" yields 3 pairings where "[Pleasant,wonderful][pleasant,experience][wonderful,experience]" can be grouped. Once we have determine the support and confidence level of each of these pairings, we will be able to generate a new dictionary containing these pairings to be run onto new data.Testing our new dictionary
Limitations & Assumptions
What Hedonometer cannot detect
Negation handling
Abbreviations, smileys/emoticons and special symbols
Local languages & slangs (Singlish)
Ambiguity
Sarcasm
Project Overall
Limitations | Assumptions |
Insufficient predicted information on the users (location, age etc.) | Data given by LARC is sufficiently accurate for the user |
Fake Twitter users | LARC will determine whether or not the users are real or not |
Ambiguity of the emotions | Emotions given by the dictionary (as instructed by LARC) is conclusive for the Tweets that is provided |
Dictionary words limited to the ones instructed by LARC | A comprehensive study has been done to come up with the dictionary |
ROI analysis
Future extension
- Scalable larger sets of data without hindering on time and performance
- Able to accommodate real-time data to provide instantaneous analytics on-the-go
Acknowledgement & credit
- Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12)
- Companion website: http://hedonometer.org/
- Schwartz HA, Eichstaedt J, Kern M, Dziurzynski L, Agrawal M, Park G, Lakshmikanth S, Jha S, Seligman M, Ungar L. (2013) Characterizing Geographic Variation in Well-Being Using Tweets. ICWSM, 2013
- Helliwell J, Layard R, Sachs J (2013) World Happiness Report 2013. United Nations Sustainable Development Solutions Network.
- Bollen J, Mao H, Zeng X (2010) Twitter mood predict the stock market. Journal of Computational Science 2(1)
- Happy Planet Index
References
- ↑ About Twitter. (2014, December). Retrieved from https://about.twitter.com/company
- ↑ Kemp, S. (2015, January 21). Digital, Social & Mobile in 2015. Retrieved from http://wearesocial.sg/blog/2015/01/digital-social-mobile-2015/
- ↑ Clifton, J. (2012, November 21). Singapore Ranks as Least Emotional Country in the World. Retrieved from http://www.gallup.com/poll/158882/singapore-ranks-least-emotional-country-world.aspx
- ↑ Clifton, J. (2012, December 19). Latin Americans Most Positive in the World, Singaporeans are the least positive worldwide. Retrieved February 9, 2015, from http://www.gallup.com/poll/159254/latin-americans-positive-world.aspx