Difference between revisions of "Social Media & Public Opinion - Final"

Revision as of 08:06, 17 April 2015

Change in project scope

Having consulted with our professor, we have decided to shift our focus away from developing a dashboard and delve into the subject of text analysis of social media data, or Twitter data. Social media has changed the way how consumers provide feedback to the products they consume. Much social media data can be mined, analysed and turn into value propositions for change in ways companies brand themselves. Although anyone and everyone can easily attain such data, there are certain challenges faced that can hamper the effectiveness of analysis. Through this project, we are going to see what are some of these challenges and way in which we can overcome them.

Text analytics using Rapidminer

Setting up Rapidminer for text analysis

Download Rapidminer from here

	Steps
	To do text processing in Rapidminer, we will need to download the pluging from the Rapidminer's plugin repository. Click on Help > Managed Extensions and search for the text processing module. Once the plugin is installed, it should appear in the "Operators" window as seen below.
	Data Preparation In Rapidminer, there are a few ways in which we can read a file or data from a database. In our case, we will be reading from the Tweets provided by the LARC team. The format of the tweets given were in the JSON format. In Rapidminer, JSON strings can be read but it is unable to read nested arrays within the string. Thus, due to this restriction, we need to extract the text from the JSON string before we can use Rapidminer to do the text analysis. We did it by converting each JSON string into an javascript object and extracting only the Id and text of each tweet and write them onto a comma seperated file(.csv) to be process later in Rapidminer.

Defining a standard

Before we can create a model for classifying tweets based on their polarity, we have to first define a standard for the classifier to learn from. To attain this standard, we manually tag a random sample of 1000 tweets with 3 categories; Positive (P), Negative (N) and Neutral (X).

With the tweets and their respective classification, we were ready to create a model for machine learning of tweets sentiments.

Creating the model

	We first used the "read CSV" operator to read the text from the prepared CSV file that was done earlier. This can be done via an "Import Configuration Wizard" or set manually.
	Each column will be seperated by a "," Trim the lines to remove any white space before and after the tweet Check the "first row as names" if there a ehader is specified
	To check on the results at any point of the process, right click on any operators and add a breakpoint. To process the document, we will need to convert the data from norminal to text.
	We will need to convert the text data into documents. In our case, each tweet will be converted in a document.
	The "process document" operator is a multi step process to break down each document into single words. The number of frequency of each word, as well as their occurrences (in documents) will be calculated and used when formulating the model. To begin the process, double click on the operator.
	Tokenize the document. This operator will split the document into single words using space (" ") as a delimiter Transform Cases. Convert each token into lower case so that the processed word is not case sensitive Filter Stopwords (English). Stop words are common words that do not hold significance in search queries. List of stop words can be found here Filter Tokens by Content. Certain words like "http", "tco" and "rt" are common tokens that are derived from the processing of these tweets. They too, hold insignificant meaning to the tweets. We use this operator to exclude such words. Be sure to check the "invert condition" option for exclusion instead of inclusion. Filter Tokens by Length. The last operator is used to filter words of a specific length. Words that are less than 3 or more than 15 characters long are excluded
	Return to the main process. We will need to add the "Set Role" process to indicate the label for each tweet. We have a column called "Classification" and we will be assigning that column to be the label.
	The "X-validation" operator will now create a model based on our manual classification which can later be used on another set of data. To begin, double click on the operator.
	We will be using the Naive Bayes classifier to model our manually tagged tweets to their respective classification. To test this model, we X-validate the model predicted values to the manually tagged classification and check the performance before we apply this model to a new set of data.
\|\| To apply this model to a new set of data, we will repeat the above steps of reading a CSV file, converting it the input to text, set the role and processing each document before applying the model to the new set of tweets.
	From the performance output, we achieved a 44.6% accuracy when the model was cross validated with the original 1000 tweets that were manually tagged. To affirm this accuracy, we randomly extracted 100 tweets from the fresh set of 5000 tweets and manually tag these tweets and cross validated with the predicted values by the model. The predicted model did in fact have an accuracy of 46%, a close percentage to the 44.2% accuracy using the X-validation module.

Improving accuracy

One of the ways to improve the accuracy of the model is to remove words that does not appear frequently within the given set of documents. By removing these words, we can ensure that the resulting words that are classified are mentioned a significant number of times. However, the challenge would be to determine what is the number of occurrences required before a word can be taken into account for classification. It is important to note that the higher the threshold, the smaller the result and word list would be.

High-Level Requirements

The system will include the following:

A timeline based on the tweets provided
The timeline will display the level of happiness as well as the volume of tweets.
Each point on the timeline will provide additional information like the overall happiness scores, the level of sentiments for each specific category etc.
Linked graphical representations based on the time line
Graphs to represent the aggregated user attributes (gender, age groups etc.)
Comparison between 2 different user defined time periods
Optional toggling of volume of tweets with sentiment timeline graph

Work Scope

Data Collection – Collect Twitter data to be analysed from LARC
Data Preparation – Clean and transform the data into a readable CSV for upload
Application Calculations and Filtering – Perform calculations and filters on the data in the app
Dashboard Construction – Build the application’s dashboard and populate with data
Dashboard Calibration – Finalize and verify the accuracy of dashboard visualizations
Stress Testing and Refinement – Test software performance whether it meets the minimum requirements of the clients and * perform any optimizations to meet these.
Literature Study – Understand sentiment and text analysis in social media
Software Learning – Learn how to use and integrate various D3.js / Hicharts libraries, and the dictionary word search provided by the client.

Methodology

The key aim of this project is to allow the user to be able to explore and analyse the happiness level of the targeted subjects based on a given set of tweets. Tweets are a string of text made up of 140 characters. Tweets may contain shorten URLs, tags (@xyz) or trending topics (#xyz) The interactive visual model prototype should allow the user to be able to see the past tweets based upon certain significant events and derive a conclusion from the results shown. To be able to do this, we will propose the following methodology. Tweet data will be provided to us from the user via uploading a csv file containing the tweets in the JSON format.

First, we will first display an overview of the tweets that we are looking at. Tweets will be aggregated into intervals based upon the span of tweets’ duration as given in the file upload. Each tweet will have a ‘happiness’ score tagged to it. “Happiness” score is derived from the study at Hedometer.org. Out of the 10,100 words that have a score tagged to it, some of them may not be applicable to words on Twitter. (Please refer to the study to find out how the score is derived). Words that are not applicable will not be used to calculate the score of the tweet and will be considered as a stop/neutral word on the application.

To visualise the words that are mentioned in these tweets, we will use a dynamically generated word cloud. A word cloud is useful in showing the users which are the words that are commonly mentioned in the tweets. The more a particular word is mentioned, the bigger it will appear on the word cloud. Stop/neutral words will be removed to ensure that only relevant words show up on the tag cloud. One thing to note is that the source of the text is from Twitter, which means that depending on the users, these tweets may contain localized words which may be hard to filter out. The list of stop words that we will be using to filter will be based upon this list.

Secondly, there is a list of predicted user attributes that is provided by the client. Each line contains attributes of one user in JSON format. The information is shown below:

id: refers to twitter id
gender
ethnicity
religion
age_group
marital_status
sleep
emotions
topics

This predicted user attributes will be displayed in the 2nd segment where the application allows users to have a quick glance of the demographics of the users.

Third, we will also display the score of the words mentioned based upon the happiness level. This will allow the user to quickly identify the words that are attributing to the negativity or positivity of the set of tweets.

The entire application will entirely be browser based and some of the benefits of doing so include:

Client does not need to download any software to run the application
It clean and fast as most of the people who own a computer would probably have a browser installed by default
It is highly scalable. Work is done on the front-end rather than on the server which may be choked when handling too many requests.

HTML5 and CSS3 will be used primarily for the display. Javascript will be used for the manipulation of the document objects front-end. Some of the open source plugins that we will be using includes:

Highchart.js – a visualisation plugin to create charts quickly.
Jquery – a cross-platform JavaScript library designed to simplify the client-side scripting of HTML
Openshift – Online free server for live deployment
Moment.js – date manipulation plugin

Deliverables

Project Proposal
Mid-term presentation
Mid-term report
Final presentation
Final report
Project poster
A web-based platform hosted on OpenShift.

Limitations & Assumptions

Limitations	Assumptions
Insufficient predicted information on the users (location, age etc.)	Data given by LARC is sufficiently accurate for the user
Fake Twitter users	LARC will determine whether or not the users are real or not
Ambiguity of the emotions	Emotions given by the dictionary (as instructed by LARC) is conclusive for the Tweets that is provided
Dictionary words limited to the ones instructed by LARC	A comprehensive study has been done to come up with the dictionary

ROI analysis

As part of LARC’s initiative to study the well-being of Singaporeans, this dashboard will be used as springboard to visually represent Singaporeans on the Twitter space and identify the general sentiments of twitter users based on a given time period. This may provide one of the useful information about people's subjective well-being which helps realise the visions of the smart nation initiative by the Singapore government to understand the well-being of Singaporeans. This project may be a standalone or a series of projects done by LARC.

Future extension

Scalable larger sets of data without hindering on time and performance
Able to accommodate real-time data to provide instantaneous analytics on-the-go

Acknowledgement & credit

Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE 6(12)
Companion website: http://hedonometer.org/
Schwartz HA, Eichstaedt J, Kern M, Dziurzynski L, Agrawal M, Park G, Lakshmikanth S, Jha S, Seligman M, Ungar L. (2013) Characterizing Geographic Variation in Well-Being Using Tweets. ICWSM, 2013
Helliwell J, Layard R, Sachs J (2013) World Happiness Report 2013. United Nations Sustainable Development Solutions Network.
Bollen J, Mao H, Zeng X (2010) Twitter mood predict the stock market. Journal of Computational Science 2(1)
Happy Planet Index

@@ Line 56: / Line 56: @@
 ===Setting up Rapidminer for text analysis===
+'''Download Rapidminer from [https://rapidminer.com/signup/ here]'''
+{| class="wikitable" width="1000px"
+|-
+| || Steps
+|-
+| [[File:Text processing module.JPG|350px|]]||
+To do text processing in Rapidminer, we will need to download the pluging from the Rapidminer's plugin repository.
 Click on Help > Managed Extensions  and search for the text processing module.
 Once the plugin is installed, it should appear in the "Operators" window as seen below.
+|-
+| [[File:Tweets Jso.JPG|500px]] ||
+===Data Preparation===
+In Rapidminer, there are a few ways in which we can read a file or data from a database. In our case, we will be reading from the Tweets provided by the LARC team. The format of the tweets given were in the JSON format. In Rapidminer, JSON strings can be read but it is unable to read nested arrays within the string. Thus, due to this restriction, we need to extract the text from the JSON string before we can use Rapidminer to do the text analysis. We did it by converting each JSON string into an javascript object and extracting only the Id and text of each tweet and write them onto a comma seperated file(.csv) to be process later in Rapidminer.
+|-
+|||
+|}
+===Defining a standard===
+Before we can create a model for classifying tweets based on their polarity, we have to first define a standard for the classifier to learn from. To attain this standard, we manually tag a random sample of 1000 tweets with 3 categories; Positive (P), Negative (N) and Neutral (X).
-===Data Preparation===
+With the tweets and their respective classification, we were ready to create a model for machine learning of tweets sentiments.
-In Rapidminer, there are a few ways in which we can read a file or data from a database. In our case, we will be reading from the Tweets provided by the LARC team. The format of the tweets given were in the JSON format. In Rapidminer, JSON strings can be read but it is unable to read nested arrays within the string. Thus, due to this restriction, we need to extract the text from the JSON string before we can use Rapidminer to do the text analysis.
+===Creating the model===
+{| class="wikitable" width="1000px"
+|-
+|[[File:ReadCsv.JPG|100px]]||
-[[File:Tweets Jso.JPG|thumbnail|500px]]
+#We first used the "read CSV" operator to read the text from the prepared CSV file that was done earlier. This can be done via an "Import Configuration Wizard" or set manually.
+|-
-{| class="wikitable"
+|[[File:ReadCsv configuration.JPG|250px]]||
+#Each column will be seperated by a ","<br>
+#Trim the lines to remove any white space before and after the tweet<br>
+#Check the "first row as names" if there a ehader is specified
+|-
+|[[File:Normtotext.JPG|100px]]||
+#To check on the results at any point of the process, right click on any operators and add a breakpoint.
+#To process the document, we will need to convert the data from norminal to text.
+|-
+|[[File:DataToDoc.JPG|100px]]||
+#We will need to convert the text data into documents. In our case, each tweet will be converted in a document.
 |-
-| Steps
+|[[File:ProcessDocument.JPG|100px]]||
+#The "process document" operator is a multi step process to break down each document into single words. The number of frequency of each word, as well as their occurrences (in documents) will be calculated and used when formulating the model.<br>
+#To begin the process, double click on the operator.
 |-
-|'''Download Rapidminer from [https://rapidminer.com/signup/ here]
+|[[File:Tokenize.JPG|500px]]||
-To do text processing in Rapidminer, we will need to download the pluging from the Rapidminer's plugin repository.
+# Tokenize the document. This operator will split the document into single words using space (" ") as a delimiter
+# Transform Cases. Convert each token into lower case so that the processed word is not case sensitive
+# Filter Stopwords (English). Stop words are common words that do not hold significance in search queries. List of stop words can be found [http://xpo6.com/list-of-english-stop-words/ here]
+# Filter Tokens by Content. Certain words like "http", "tco" and "rt" are common tokens that are derived from the processing of these tweets. They too, hold insignificant meaning to the tweets. We use this operator to exclude such words. Be sure to check the "invert condition" option for exclusion instead of inclusion.
+#Filter Tokens by Length. The last operator is used to filter words of a specific length. Words that are less than 3 or more than 15 characters long are excluded
 |-
-| [[File:Text processing module.JPG|350px|]]||
+|[[File:Setrole.JPG|100px]]||
-[[File:Breeding area NNI stsx.PNG|400px]]
+# Return to the main process.
+# We will need to add the "Set Role" process to indicate the label for each tweet. We have a column called "Classification" and we will be assigning that column to be the label.
 |-
-| '''Results''' ||
+|[[File:Validation.JPG|100px]]||
-Observed mean distance:195.52<br />
+# The "X-validation" operator will now create a model based on our manual classification which can later be used on another set of data.
-Expected mean distance:579.82<br />
+# To begin, double click on the operator.
-'''Nearest neighbour index:0.34<br />'''
-N:450<br />
-'''Z-Score:-26.90'''
 |-
-| '''Null Hypothesis''' || Reject Null Hypothesis
+|[[File:ValidationX.JPG|500px]]||
+# We will be using the Naive Bayes classifier to model our manually tagged tweets to their respective classification. To test this model, we X-validate the model predicted values to the manually tagged classification and check the performance before we apply this model to a new set of data.
 |-
-| '''Distribution''' || Clustered
+|
+[[File:5000Data.JPG|500px]]||
+# To apply this model to a new set of data, we will repeat the above steps of reading a CSV file, converting it the input to text, set the role and processing each document before applying the model to the new set of tweets.
 |-
-| '''Analysis''' || The '''Z-score is -26.90''' which does not falls within the 95% interval. This mean that the null hypothesis is rejected, suggesting that the distribution is not random.<br /><br />The '''NNI value of 0.34 is lower than 1'''. This suggests that the distribution of breeding cases exhibits a clustered pattern.
+|[[File:Prediction.JPG|500px]]||
+# From the performance output, we achieved a 44.6% accuracy when the model was cross validated with the original 1000 tweets that were manually tagged. To affirm this accuracy, we randomly extracted 100 tweets from the fresh set of 5000 tweets and manually tag these tweets and cross validated with the predicted values by the model. The predicted model did in fact have an accuracy of '''46%''', a close percentage to the 44.2% accuracy using the X-validation module.
 |}
+===Improving accuracy===
+One of the ways to improve the accuracy of the model is to remove words that does not appear frequently within the given set of documents. By removing these words, we can ensure that the resulting words that are classified are mentioned a significant number of times. However, the challenge would be to determine what is the number of occurrences required before a word can be taken into account for classification. It is important to note that the higher the threshold, the smaller the result and word list would be.
+{| class="wikitable" width="1000px"
+|-
+|||
 </div>

Difference between revisions of "Social Media & Public Opinion - Final"

Revision as of 08:06, 17 April 2015

Contents

Change in project scope

Text analytics using Rapidminer

Setting up Rapidminer for text analysis

Data Preparation

Defining a standard

Creating the model

Improving accuracy

High-Level Requirements

Work Scope

Methodology

Deliverables

Limitations & Assumptions

ROI analysis

Future extension

Acknowledgement & credit

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools