Social Media & Public Opinion - Final
Contents
Change in project scope
Having consulted with our professor, we have decided to shift our focus away from developing a dashboard and delve into the subject of text analysis of social media data, or Twitter data. Social media has changed the way how consumers provide feedback to the products they consume. Much social media data can be mined, analysed and turn into value propositions for change in ways companies brand themselves. Although anyone and everyone can easily attain such data, there are certain challenges faced that can hamper the effectiveness of analysis. Through this project, we are going to see what are some of these challenges and way in which we can overcome them.
Methodology: Text analytics using Rapidminer
Setting up Rapidminer for text analysis
Download Rapidminer from here
Defining a standard
Before we can create a model for classifying tweets based on their polarity, we have to first define a standard for the classifier to learn from. To attain this standard, we manually tag a random sample of 1000 tweets with 3 categories; Positive (P), Negative (N) and Neutral (X).
With the tweets and their respective classification, we were ready to create a model for machine learning of tweets sentiments.
Creating the model
Improving accuracy
One of the ways to improve the accuracy of the model is to remove words that does not appear frequently within the given set of documents. By removing these words, we can ensure that the resulting words that are classified are mentioned a significant number of times. However, the challenge would be to determine what is the number of occurrences required before a word can be taken into account for classification. It is important to note that the higher the threshold, the smaller the result and word list would be.
Deliverables
- Project Proposal
- Mid-term presentation
- Mid-term report
- Final presentation
- Final report
- Project poster
- A web-based platform hosted on OpenShift.
Limitations & Assumptions
Limitations | Assumptions |
Insufficient predicted information on the users (location, age etc.) | Data given by LARC is sufficiently accurate for the user |
Fake Twitter users | LARC will determine whether or not the users are real or not |
Ambiguity of the emotions | Emotions given by the dictionary (as instructed by LARC) is conclusive for the Tweets that is provided |
Dictionary words limited to the ones instructed by LARC | A comprehensive study has been done to come up with the dictionary |
ROI analysis
As part of LARC’s initiative to study the well-being of Singaporeans, this dashboard will be used as springboard to visually represent Singaporeans on the Twitter space and identify the general sentiments of twitter users based on a given time period. This may provide one of the useful information about people's subjective well-being which helps realise the visions of the smart nation initiative by the Singapore government to understand the well-being of Singaporeans. This project may be a standalone or a series of projects done by LARC.
Future extension
- Scalable larger sets of data without hindering on time and performance
- Able to accommodate real-time data to provide instantaneous analytics on-the-go
Acknowledgement & credit
- Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12)
- Companion website: http://hedonometer.org/
- Schwartz HA, Eichstaedt J, Kern M, Dziurzynski L, Agrawal M, Park G, Lakshmikanth S, Jha S, Seligman M, Ungar L. (2013) Characterizing Geographic Variation in Well-Being Using Tweets. ICWSM, 2013
- Helliwell J, Layard R, Sachs J (2013) World Happiness Report 2013. United Nations Sustainable Development Solutions Network.
- Bollen J, Mao H, Zeng X (2010) Twitter mood predict the stock market. Journal of Computational Science 2(1)
- Happy Planet Index