Social Media & Public Opinion - Final
Contents
- 1 Abstract
- 2 Introduction
- 3 Change in Project Scope
- 4 Related Work
- 5 Methodology
- 6 Improving Accuracy
- 7 Comparing the Performance of the 3 Classifiers
- 8 Deriving Insights from Emoticons
- 9 Deriving Word Associations
- 10 “Purified” English-only Training Data
- 11 Pitfalls of Using Conventional Text Analysis on Social Media Data
- 12 Future Work / Improving the Effectiveness of Sentiment Analysis of Social Media Data
- 13 Acknowledgements
- 14 References
Abstract
Introduction
Change in Project Scope
Although anyone and everyone can easily attain such data, there are certain challenges faced that can hamper the effectiveness of such analysis.
- Can conventional text analysis methods be done on social media data?
- How effective are these methods?
- What are some of the unique features of social media that we need to take note of when doing text analysis on them?
Related Work
A real-time text-based hedonometer was built to measure happiness of over 63 million Twitter users over 33 months, as recorded in the paper Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter (Dodds et al., 2011)[6]. It shows how a highly robust and tunable metric can be constructed with the word list chosen solely by frequency of usage.
In the paper Twitter as a Corpus for Sentiment Analysis and Opinion Mining (Pak & Paroubek, 2010)[7], the authors show how to use Twitter as a corpus for sentiment analysis and opinion mining, perform linguistic analysis of the collected corpus and build a sentiment classifier that is able to determine positive, negative and neutral sentiments for a document. The authors build a sentiment classifier using the multinomial Naïve Bayes classifier that uses N-gram and part-of-speech tags as features as it yielded the best results as compared to Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) classifiers. We will be using the Naive Bayes classifier too.
For the paper Twitter Sentiment Analysis: The Good the Bad and the OMG! (Kouloumpis et al., 2011)[8], the authors investigate the usefulness of linguistic features for detecting the sentiment of tweets. The results show that show that part-of-speech features may not be useful for sentiment analysis in the microblogging domain while the microblogging features (i.e., the presence of intensifiers and positive / negative / neutral emoticons and abbreviations) were clearly the most useful.
The authors mentioned in the paper Exploiting Emoticons in Sentiment Analysis (Hogenboom et al., 2013)[9] created an emoticon sentiment lexicon in order to improve a state-of-the-art lexicon-based sentiment classification method. It demonstrated that people typically use emoticons in natural language text in order to express, stress, or disambiguate their sentiment in particular text segments, thus rendering them potentially better local proxies for people’s intended overall sentiment than textual cues. We will be analysing emoticons too to improve the accuracy of our model.
In the paper Tokenization and Filtering Process in RapidMiner (Verma et al., 2014)[10], the authors shows how text mining is implemented in Rapidminer through tokenisation, stopword elimination, stemming and filtering. We will be using RapidMiner too in our methodology.
Methodology
- Information extraction
- Information retrieval
- Document Summarisation
- Sentence Parsing
Approach: Supervised Learning
Download RapidMiner here
Screenshots | Steps |
Setting up RapidMiner for Text AnalysisTo carry out text processing in RapidMiner, we need to download the plugin required from the RapidMiner's plugin repository.
Click on Help > Managed Extensions and search for the text processing module. Once the plugin is installed, it should appear in the "Operators" window as seen below. | |
Data PreparationIn RapidMiner, there are a few ways in which we can read a file or data from a database. In our case, we will be reading from tweets provided by the LARC team. The format of the tweets given was in the JSON format. In RapidMiner, JSON strings can be read but it is unable to read nested arrays within the string. Thus, due to this restriction, we need to extract the text from the JSON string before we can use RapidMiner to do the text analysis. We did it by converting each JSON string into a Javascript object and extracting only the Id and text of each tweet and write them onto a comma separated file (.csv) to be process later in RapidMiner.
|
Defining a Standard
Creating a Model
Screenshots | Steps |
Read CSV
We first used the "read CSV" operator to read the text from the prepared CSV file that was done earlier. This can be done via an "Import Configuration Wizard" or set manually. | |
Each column is separated by a ",". | |
Nominal to Text To check the results at any point of the process, right click on any operators and add a breakpoint.
To process the document, we convert the data from nominal to text.
| |
Data to Documents We convert the text data into documents. In our case, each tweet is converted in a document. | |
Process Documents The "process document" operator is a multi-step process to break down each document into single words. The number of frequency of each word as well as their occurrences (in documents) are calculated and used when formulating the model. To begin the process, double-click on the operator.
| |
1. Tokenizing the tweet by word Tokenization is the process of breaking a stream of text up into words or other meaningful elements called tokens to explore words in a sentence. Punctuation marks as well as other characters like brackets, hyphens, etc. are removed.
2. Converting words to lowercase All words are transformed to lowercase as the same word would be counted differently if it was in uppercase vs. lowercase. 3. Eliminating stopwords The most common words such as prepositions, articles and pronouns are eliminated as it helps to improve system performance and reduces text data. 4. Filtering tokens that are smaller than 3 letters in length Filters tokens based on their length (i.e. the number of characters they contain). We set a minimum number of characters to be 3. 5. Stemming using Porter2’s stemmer Stemming is a technique for the reduction of words into their stems, base or root. When words are stemmed, we are keeping the core of the characters which convey effectively the same meaning.
Porter Stemmer vs Snowball (Porter2)[12] Porter: Most commonly used stemmer without a doubt, also one of the gentlest stemmers. It is one of the most computationally intensive of the algorithms (granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin. Snowball (Porter2): Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that Snowball is better than his original algorithm. Has a slightly faster computation time than snowball, with a fairly large community around it.We use the Porter2 stemmer. | |
Term Weighting
We used TF-IDF (term frequency*inverse document frequency) to set the importance of each word to a particular label. The TF-IDF takes into account 2 things: if a term appears on a lot of documents, each time it appears in a document, it is probably not so important. Conversely, if a term is seldom used in most of the documents, when it appears, the term is likely to be important. | |
Set Role Return to the main process.
We need to add the "Set Role" process to indicate the label for each tweet. We have a column called "Classification" to assign the label for that. | |
Cross-Validation The "X-validation" operator creates a model based on our manual classification which can later be used on another set of data.
To begin, double click on the operator.
| |
Naive Bayes Classifier We carry out an X-validation using the Naive Bayes model classification, a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature.
| |
Apply Model to New Data To apply this model to a new set of data, we repeat the above steps of reading a CSV file, converting it the input to text, set the role and processing each document before applying the model to the new set of tweets.
| |
Results From the performance output, we achieved 44.6% accuracy when the model was cross validated with the original 1000 tweets that were manually tagged. To affirm this accuracy, we randomly extracted 100 tweets from the fresh set of 5000 tweets and manually tag these tweets and cross validated with the predicted values by the model. The predicted model did in fact have an accuracy of 46%, a close percentage to the 44.2% accuracy using the X-validation module.
|
Improving Accuracy
Pruning
We experimented with multiple values to determine the most appropriate amount of words to be pruned off, bearing in mind that we need a sizeable number of words with a high enough accuracy yield.
- Percentage pruned refers to the words that are removed from the word list that do not occur within the said amount of documents. e.g. for 1% pruned out of the set of 1000 documents, words that appeared in less than 10 documents are removed from the word list.
Percentage Pruned | Percentage Accuracy | Deviation | Size of resulting word list |
---|---|---|---|
0% | 39.8% | 5.24% | 3833 |
0.5% | 44.2% | 4.87% | 153 |
1% | 42.2% | 2.68% | 47 |
2% | 45.1% | 1.66% | 15 |
5% | 43.3% | 2.98% | 1 |
Results
(Click on the image to enlarge)
Types of Classifiers
Support Vector Machine
K-Nearest Neighbour
Naives Bayes
The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix.
A kernel is a weighting function used in non-parametric estimation techniques. Kernels are used in kernel density estimation to estimate random variables' density functions, or in kernel regression to estimate the conditional expectation of a random variable.[15]
Comparing the Performance of the 3 Classifiers
The classifiers work effectively when 0% of the processed word list is used. Support Vector Machine performs well when there is good separation of points on the data plane (since functional margin of the new tokens are closer to the points gotten from the training data). However, the words that has too low an occurrence should not be taken into account as the weight they hold to determine a category is too minute.
2 other boundaries were tested, mainly 0.5% of the word list pruned as well as at 1%. At 2% pruned, the word list falls to a size of 15.
The different model classifiers performs around similar levels of accuracy, and the Naïve Bayes (kernel) method which had a level of 50% accuracy was chosen as our choice of a model classifier.
Deriving Insights from Emoticons
Experiment
- Print the entire list of tweets that we have.
- Identify the ones that has a converted emoticon tag (e.g. "😔").
- Get the list of emoticons from an emoticon library[16] and tag each emoticon with positive (P), Negative (N) and Neutral (X).
- For each tweet that we have, we manually tag the tweets based on the sentiments of the tweets.
- Cross validate that with the sentiments of the emoticons present in the tweet.
- Calculate the percentage of matches between the 2 tagged values.
Deriving Word Associations
Screenshots | Steps |
Import Data & Process Document
| |
Frequency Pattern Growth
| |
Results
|
“Purified” English-only Training Data
To test if a training set of data that contains only English words, works more effectively against another with multiple languages (Malay, Chinese, and other mainly Asian languages), we compare the accuracy of the resulting classification and see which produce a better result. We screened 1000 tweets that contain only English words versus the original training data that was randomly picked out. The Naïve Bayes classifier was used as per the earlier test cases.
The performance of the training data that contains only English words was 13% more accurate, with an accuracy of 63% for the set of training data. See here for the set of training data.
Pitfalls of Using Conventional Text Analysis on Social Media Data
Multiple Languages
Misspelled Words and Abbreviations
Length of Status
Other Media Types
Future Work / Improving the Effectiveness of Sentiment Analysis of Social Media Data
Increasing Size of Training Data
Leveraging on Emoticons
Allowing the User to Tag Their Feelings to Their Status
Analyse Data on an Event/Topic Basis Rather Than on Time
Acknowledgements
References
- ↑ About Twitter. (2014, December). Retrieved from https://about.twitter.com/company
- ↑ Kemp, S. (2015, January 21). Digital, Social & Mobile in 2015. Retrieved from http://wearesocial.sg/blog/2015/01/digital-social-mobile-2015/
- ↑ Yap, J. (2014, June 4). How many Twitter users are there in Singapore? Retrieved April 22, 2015, from https://vulcanpost.com/10812/many-twitter-users-singapore/
- ↑ Gaza takes Twitter by storm. (2014, August 20). Retrieved April 22, 2015, from http://www.vocfm.co.za/gaza-takes-twitter-by-storm/
- ↑ MasterMineDS. (2014, August 6). 2014 Israel – Gaza Conflict: Twitter Sentiment Analysis. Retrieved April 22, 2015, from http://www.wesaidgotravel.com/2014-israel-gaza-conflict-twitter-sentiment-analysis-mastermineds
- ↑ Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12): e26752. doi:10.1371/journal.pone.0026752
- ↑ Pak, A., & Paroubek, P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining.
- ↑ Kouloumpis, E., Wilson, T., & Moore, J. (2011). . In International AAAI Conference on Weblogs and Social Media. Retrieved from https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2857/3251
- ↑ Hogenboom, A. and Bal, D. and Frasincar, F. and Bal, M. and de Jong, F.M.G. and Kaymak, U. (2013) Exploiting emoticons in sentiment analysis. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC 2013, 18-22 Mar 2013, Lisbon, Portugal. pp. 703-710. ACM. ISBN 978-1-4503-1656-9
- ↑ Verma, T., & Renu, D. G. (2014). Tokenization and Filtering Process in RapidMiner. International Journal of Applied Information Systems (IJAIS)–ISSN, 2249-0868.
- ↑ Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 165-210. Retrieved from http://people.cs.pitt.edu/~wiebe/pubs/papers/lre05.pdf
- ↑ Tyranus, S. (2012, June 26). What are the major differences and benefits of Porter and Lancaster Stemming algorithms? Retrieved from http://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg
- ↑ Support Vector Machine (RapidMiner Studio Core). (n.d.). Retrieved April 22, 2015, from http://docs.rapidminer.com/studio/operators/modeling/classification_and_regression/svm/support_vector_machine.html
- ↑ K-NN (RapidMiner Studio Core). (n.d.). Retrieved April 22, 2015, from http://docs.rapidminer.com/studio/operators/modeling/classification_and_regression/lazy_modeling/k_nn.html
- ↑ Naive Bayes (RapidMiner Studio Core). (n.d.). Retrieved April 22, 2015, from http://docs.rapidminer.com/studio/operators/modeling/classification_and_regression/bayesian_modeling/naive_bayes.html
- ↑ Emoticon - emotions library https://github.com/wooorm/emoji-emotion/blob/master/data/emoji-emotion.json