Difference between revisions of "Social Media & Public Opinion - Final"

Revision as of 09:42, 18 April 2015

Change in project scope

Having consulted with our professor, we have decided to shift our focus away from developing a dashboard and delve deeper into the subject of text analysis of social media data, specifically Twitter data. Social media has changed the way how consumers provide feedback to the products they consume. Much social media data can be mined, analysed and turned into value propositions for change in ways companies brand themselves. Although anyone and everyone can easily attain such data, there are certain challenges faced that can hamper the effectiveness of such analysis. Through this project, we are going to explore what some of these challenges are and ways in which we can overcome them.

Methodology: Text analytics using RapidMiner

Download RapidMiner here

Screenshots	Steps
	Setting up RapidMiner for text analysis To carry out text processing in RapidMiner, we need to download the plugin required from the RapidMiner's plugin repository. Click on Help > Managed Extensions and search for the text processing module. Once the plugin is installed, it should appear in the "Operators" window as seen below.
(click to enlarge image)	Data Preparation In RapidMiner, there are a few ways in which we can read a file or data from a database. In our case, we will be reading from tweets provided by the LARC team. The format of the tweets given was in the JSON format. In RapidMiner, JSON strings can be read but it is unable to read nested arrays within the string. Thus, due to this restriction, we need to extract the text from the JSON string before we can use RapidMiner to do the text analysis. We did it by converting each JSON string into an javascript object and extracting only the Id and text of each tweet and write them onto a comma separated file (.csv) to be process later in RapidMiner.

Defining a Standard

Before we can create a model for classifying tweets based on their polarity, we have to first define a standard for the classifier to learn from. In order to attain such a standard, we manually tag a random sample of 1000 tweets with 3 categories; Positive (P), Negative (N) and Neutral (X) through a mutual agreement between the 3 of us. One of the challenges faced is understanding irony as even humans sometimes face difficulty understanding someone who is being sarcastic. It is proven in a University of Pittsburgh study that humans can only agree on whether or not a sentence has the correct sentiment 80% of the time.^[1]

With the tweets and their respective classification, we were ready to create a model for machine learning of tweets' sentiments.

Creating a Model

Screenshots

Steps

We first used the "read CSV" operator to read the text from the prepared CSV file that was done earlier. This can be done via an "Import Configuration Wizard" or set manually.

Each column is separated by a ",".
Trim the lines to remove any white space before and after the tweet.
Check the "first row as names" if there a header is specified.

To check the results at any point of the process, right click on any operators and add a breakpoint.
To process the document, we convert the data from nominal to text.

We convert the text data into documents. In our case, each tweet is converted in a document.

The "process document" operator is a multi-step process to break down each document into single words. The number of frequency of each word as well as their occurrences (in documents) are calculated and used when formulating the model.
To begin the process, double-click on the operator.

1. Tokenizing the tweet by word

Tokenization is the process of breaking a stream of text up into words or other meaningful elements called tokens to explore words in a sentence. Punctuation marks as well as other characters like brackets, hyphens, etc are removed.

2. Converting words to lowercase

All words are transformed to lowercase as the same word would be counted differently if it was in uppercase vs. lowercase.

3. Eliminating stopwords

The most common words such as prepositions, articles and pronouns are eliminated as it helps to improve system performance and reduces text data.

4. Filtering tokens that are smaller than 3 letters in length

Filters tokens based on their length (i.e. the number of characters they contain). We set a minimum number of characters to be 3.

5. Stemming using Porter2’s stemmer

Stemming is a technique for the reduction of words into their stems, base or root. When words are stemmed, we are keeping the core of the characters which convey effectively the same meaning.

Porter Stemmer vs Snowball (Porter2)^[2]

Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. It is one of the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.

Snowball (Porter2): Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that Snowball is better than his original algorithm. Slightly faster computation time than snowball, with a fairly large community around it.

We use the Porter2 stemmer.

Return to the main process.
We need to add the "Set Role" process to indicate the label for each tweet. We have a column called "Classification" to assign the label for that.

The "X-validation" operator creates a model based on our manual classification which can later be used on another set of data.
To begin, double click on the operator.

We carry out an X-validation using the Naive Bayes model classification, a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. In simple terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. The advantage of the Naive Bayes classifier is that it only requires a small amount of training data to estimate the means and variances of the variables necessary for classification. Because independent variables are assumed, only the variances of the variables for each label need to be determined and not the entire covariance matrix.
Model Classifiers ^[3]

Classifier	Advantage of classifier
Naive Bayes	"Generative" Require less training data Quick and easy
Logistic Regression	"Discriminative classfier" Requires more training data
SVM	Higher accuracy Popular in text classification problems Suited for high dimensional spaces

To apply this model to a new set of data, we repeat the above steps of reading a CSV file, converting it the input to text, set the role and processing each document before applying the model to the new set of tweets.

From the performance output, we achieved 44.6% accuracy when the model was cross validated with the original 1000 tweets that were manually tagged. To affirm this accuracy, we randomly extracted 100 tweets from the fresh set of 5000 tweets and manually tag these tweets and cross validated with the predicted values by the model. The predicted model did in fact have an accuracy of 46%, a close percentage to the 44.2% accuracy using the X-validation module.

Improving accuracy

One of the ways to improve the accuracy of the model is to remove words that do not appear frequently within the given set of documents. By removing these words, we can ensure that the resulting words that are classified are mentioned a significant number of times. However, the challenge is to determine what the number of occurrences required is before a word can be taken into account for classification. It is important to note that the higher the threshold, the smaller the result and word list would be. Practical problems exist when modeling text statistically, since we require a reasonably sized corpus in order to overcome sparseness problems, but at the same time we face the challenge of irrelevant words exerting their weights on an independent set of test data when applying the model

We experimented with multiple values to determine the most appropriate amount of words to be pruned off, bearing in mind that we need a sizeable number of words with a high enough accuracy yield.

Percentage pruned refers to the words that are removed from the word list that do not occur within the said amount of documents. e.g. for 1% pruned out of the set of 1000 documents, words that appeared in less than 10 documents are removed from the word list.

Percentage Pruned	Percentage Accuracy	Deviation	Size of resulting word list
0%	39.3%	5.24%	3833
0.5%	44.2%	4.87%	153
1%	42.2%	2.68%	47
2%	45.1%	1.66%	15
5%	43.3%	2.98%	1

From the results, we could infer that a large number of words (3680) appears only in less than 5 documents as we see the resulting size of the word list falls from 3833 to 153 when we set the percentage pruned at 0.5%

Results

Click on the image to enlarge

0% pruned
0.5% pruned
1% pruned
2% pruned
5% pruned

Deriving insights from emoticons

An emotion icon, better known by the emoticon is a metacommunicative pictorial representation of a facial expression that, in the absence of body language and prosody, serves to draw a receiver's attention to the tenor or temper of a sender's nominal verbal communication, changing and improving its interpretation. It expresses — usually by means of punctuation marks (though it can include numbers and letters) — a person's feelings or mood, though as emoticons have become more popular, some devices have provided stylized pictures that do not use punctuation.

Experiment

The data that we have was in plain text. To be able to view the emoticons, we needed a "translator" to convert the emoticons used. This can be done on any browser with a plugin to convert these emoticons. We carried out the following steps:

Print the entire list of tweets that we have
Identify the ones that has a converted emoticon tag (eg "😔")
Get the list of emoticons from an emoticon library^[4] and tag each emoticon with positive (P), Negative (N) and Neutral(X)
For each tweet that we have, we manually tag the tweets based on the sentiments of the tweets.
Cross validate that with the sentiments of the emoticons present in the tweet
Calculate the percentage of matches between the 2 tagged values.

We carried out this experiment on 100 random tweets with emoticons and matched the accuracy of the sentiments. We achieved a 82% match/accuracy in terms of using the emoticons to determine the sentiments of the tweets.

Deriving word associations from tweets

Modeling words co-occurrence is important for many natural language applications, such as topic segmentation (Ferret, 2002), query expansion (Vechtomova et al., 2003), machine translation (Tanaka,2002), language modeling (Dagan et al., 1999; Yuret, 1998), and term weighting (Hisamitsu andNiwa, 2002). We want to know if certain words given within a set of tweets happens more often than expected by chance.

Creating word Associations with rapidminer

Screenshots	Steps'
	We will first import the data as explained earlier where we did the classification of the words in the text analysis process For the "Process Document" operator, we will have to set the "vector creation" option to binary term occurences for the "FP-Growth" Operator later
	We will need to convert the word matrix from "process documents" to binomial form for the "FP-Growth" Operator. This will convert all the "0"s and "1"s to "false" and "true" respectively.
	Out of the 20000 tweets, we were only able to draw 121 sets of word associations, of which 25 contains 2 words, 12 contain 3 words and 2 contain 1 word The word with the highest support stands at 0.044, a far cry from the minimal support of 0.75, commonly used for associating words. We conclude that word associations would be irrelevant when it comes to Twitter Data.

With tweets holding at most a 140 characters, it is no surprise that we are unable to derive high volumes of word associations from the data set. It is even harder to derive word associations when the data set is time-based rather than event-based. The topics discussed varies greatly, making it hard to formulate word associations

Pitfalls of using conventional text analysis on social media data

Multiple languages

In Singapore twitter's sphere, being a multilingual community makes it more challenging to do text analysis as we have to take into account different languages. For each specific language, a dictionary is required to translate the text to the English language before any natural language processing can be done on the text. With advanced tools like Rapidminer not being able to accommodate Chinese, Malay or even Korean words, much work have to be done to come up with a localisation tool to analyse the social media data here.

Misspelled words and abbreviations

With the limitation of 140 chars in twitter, twitter users are fond of using abbreviations and short forms to substitute words that they want to convey. A huge challenge is to unravel misspelled words , and differentiating the former with these words as well. This can be done using a more robust or aggressive stemmer that deciphers abbrevations, remove unnecessary repeated characters in a word and correcting short forms to their root words.

Length of status

The length of status is 140 characters long, which makes it difficult to have any word associations with strong support and confidence levels. Given such a short length, there may be insufficient space to substantiate a point or may lack evidence to the true sentiments of the tweet.

Other media types

Other media types (URL, image URL and video URLs) are common attachments that Twitter users used to convey a message. In certain cases, this media type makes up the entire tweet, which nullifies any textual analysis done on the tweet itself. Much more context may be derived if information of the link is embedded into the tweet itself. Unfortunately, such a feature is still not available and hence, hinders the process of analysis on Tweets.

Improving the effectiveness of sentiment analysis of social media data

Increasing size of training data

The larger the size of data, the more accurate the model would be. However, the time to process and apply the model may also increase.

Leveraging on Emoticons

Emoticons provides more insights to how the user is feeling with just a single character. In tweets, where the number of characters is a valuable resource, emoticons comes into play quite frequently. Being able to dissect a tweet based on the emoticons in it and assigning a sentiment score to the emoticons use, we can get a more accurate depiction of the tweet's overall sentiment score as compared to analyzing the text itself

Allowing the user to tag their feelings to their status

One of the ways in which Facebook may make such analysis easier is by allowing the user to specify how he/she is feeling at the moment of posting a status. With this option, Facebook has effectively increase the probability of determining the right sentiment of the user at the point in time. This mitigates the possibility of sarcasm or other inferred sentiments within that post itself

Analyse data on an event/topic basis rather than on time

The data that we used was within a given time frame of 1 month. Drilling down this tweets to a particular topic (hashtag) or an event would bring about more significant results. Brands which want to conduct sentiment analysis on social media data should make it specific to a particular campaign/event/initiative.

References

↑ Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 165-210. Retrieved from http://people.cs.pitt.edu/~wiebe/pubs/papers/lre05.pdf
↑ Tyranus, S. (2012, June 26). What are the major differences and benefits of Porter and Lancaster Stemming algorithms? Retrieved from http://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg
↑ Choosing a machine learning classifier by Edwin Chen - http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/
↑ Emoticon - emotions library https://github.com/wooorm/emoji-emotion/blob/master/data/emoji-emotion.json

[1] Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating Expressions of Opinions and Emotions in Language. Language Resources and Evaluation, 165-210. Retrieved from http://people.cs.pitt.edu/~wiebe/pubs/papers/lre05.pdf

[2] Tyranus, S. (2012, June 26). What are the major differences and benefits of Porter and Lancaster Stemming algorithms? Retrieved from http://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg

[3] Choosing a machine learning classifier by Edwin Chen - http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/

[4] Emoticon - emotions library https://github.com/wooorm/emoji-emotion/blob/master/data/emoji-emotion.json

[1]

[2]

[3]

[4]

@@ Line 202: / Line 202: @@
 ===Results===
+Click on the image to enlarge
 <gallery>
 File:Manual_tag_1000_Performance_pruning_no.JPG|0% pruned

Difference between revisions of "Social Media & Public Opinion - Final"

Revision as of 09:42, 18 April 2015

Contents

Change in project scope

Methodology: Text analytics using RapidMiner

Setting up RapidMiner for text analysis

Data Preparation

Defining a Standard

Creating a Model

Improving accuracy

Results

Deriving insights from emoticons

Experiment

Deriving word associations from tweets

Creating word Associations with rapidminer

Pitfalls of using conventional text analysis on social media data

Multiple languages

Misspelled words and abbreviations

Length of status

Other media types

Improving the effectiveness of sentiment analysis of social media data

Increasing size of training data

Leveraging on Emoticons

Allowing the user to tag their feelings to their status

Analyse data on an event/topic basis rather than on time

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools