Social Media & Public Opinion - Final
Contents
Change in project scope
Having consulted with our professor, we have decided to shift our focus away from developing a dashboard and delve deeper into the subject of text analysis of social media data, specifically Twitter data. Social media has changed the way how consumers provide feedback to the products they consume. Much social media data can be mined, analysed and turned into value propositions for change in ways companies brand themselves. Although anyone and everyone can easily attain such data, there are certain challenges faced that can hamper the effectiveness of such analysis. Through this project, we are going to explore what some of these challenges are and ways in which we can overcome them.
Methodology: Text analytics using RapidMiner
Download RapidMiner here
Screenshots | Steps |
Setting up RapidMiner for text analysisTo carry out text processing in RapidMiner, we need to download the plugin required from the RapidMiner's plugin repository. Click on Help > Managed Extensions and search for the text processing module. Once the plugin is installed, it should appear in the "Operators" window as seen below. | |
|
Data PreparationIn RapidMiner, there are a few ways in which we can read a file or data from a database. In our case, we will be reading from tweets provided by the LARC team. The format of the tweets given was in the JSON format. In RapidMiner, JSON strings can be read but it is unable to read nested arrays within the string. Thus, due to this restriction, we need to extract the text from the JSON string before we can use RapidMiner to do the text analysis. We did it by converting each JSON string into an javascript object and extracting only the Id and text of each tweet and write them onto a comma separated file (.csv) to be process later in RapidMiner. |
Defining a Standard
Before we can create a model for classifying tweets based on their polarity, we have to first define a standard for the classifier to learn from. In order to attain such a standard, we manually tag a random sample of 1000 tweets with 3 categories; Positive (P), Negative (N) and Neutral (X) through a mutual agreement between the 3 of us. One of the challenges faced is understanding irony as even humans sometimes face difficulty understanding someone who is being sarcastic. It is proven in a University of Pittsburgh study that humans can only agree on whether or not a sentence has the correct sentiment 80% of the time.[1]
With the tweets and their respective classification, we were ready to create a model for machine learning of tweets' sentiments.
Creating a Model
Screenshots | Steps |
| |
| |
| |
| |
| |
1. Tokenizing the tweet by word Tokenization is the process of breaking a stream of text up into words or other meaningful elements called tokens to explore words in a sentence. Punctuation marks as well as other characters like brackets, hyphens, etc are removed. 2. Converting words to lowercase All words are transformed to lowercase as the same word would be counted differently if it was in uppercase vs. lowercase. 3. Eliminating stopwords The most common words such as prepositions, articles and pronouns are eliminated as it helps to improve system performance and reduces text data. 4. Filtering tokens that are smaller than 3 letters in length Filters tokens based on their length (i.e. the number of characters they contain). We set a minimum number of characters to be 3. 5. Stemming using Porter2’s stemmer Stemming is a technique for the reduction of words into their stems, base or root. When words are stemmed, we are keeping the core of the characters which convey effectively the same meaning. Porter Stemmer vs Snowball (Porter2)[2] Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. It is one of the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin. Snowball (Porter2): Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that Snowball is better than his original algorithm. Slightly faster computation time than snowball, with a fairly large community around it. We use the Porter2 stemmer. | |
| |
| |
| |
| |
|
Improving Accuracy
One of the ways to improve the accuracy of the model is to remove words that do not appear frequently within the given set of documents. By removing these words, we can ensure that the resulting words that are classified are mentioned a significant number of times. However, the challenge is to determine what the number of occurrences required is before a word can be taken into account for classification. It is important to note that the higher the threshold, the smaller the result and word list would be.
Limitations & Assumptions
Limitations | Assumptions |
Insufficient predicted information on the users (location, age etc.) | Data given by LARC is sufficiently accurate for the user |
Fake Twitter users | LARC will determine whether or not the users are real or not |
Ambiguity of the emotions | Emotions given by the dictionary (as instructed by LARC) is conclusive for the Tweets that is provided |
Dictionary words limited to the ones instructed by LARC | A comprehensive study has been done to come up with the dictionary |
Future extension
- Scalable larger sets of data without hindering on time and performance
- Able to accommodate real-time data to provide instantaneous analytics on-the-go
Acknowledgement & credit
- Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12)
- Companion website: http://hedonometer.org/
- Schwartz HA, Eichstaedt J, Kern M, Dziurzynski L, Agrawal M, Park G, Lakshmikanth S, Jha S, Seligman M, Ungar L. (2013) Characterizing Geographic Variation in Well-Being Using Tweets. ICWSM, 2013
- Helliwell J, Layard R, Sachs J (2013) World Happiness Report 2013. United Nations Sustainable Development Solutions Network.
- Bollen J, Mao H, Zeng X (2010) Twitter mood predict the stock market. Journal of Computational Science 2(1)
- Happy Planet Index
References
- ↑ Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 165-210. Retrieved from http://people.cs.pitt.edu/~wiebe/pubs/papers/lre05.pdf
- ↑ Tyranus, S. (2012, June 26). What are the major differences and benefits of Porter and Lancaster Stemming algorithms? Retrieved from http://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg