Twitter Analytics: Findings

From Analytics Practicum
Revision as of 16:57, 12 October 2014 by Ffortunata.2011 (talk | contribs)
Jump to navigation Jump to search


Home   Project Overview   Project Management   Documentation   Findings   About Me

Descriptive analysis

Characters per tweet

Fap101.png

The mean and median of the character of the Tweet is around 120 characters while the maximum is 208 characters.

Based on research done by Buddy Media and Track Social, engagement rate is optimum at 100 characters in a tweet. Hence, companies should look into not only maximizing their tweet characters post but also the optimum of it.

http://blog.bufferapp.com/the-ideal-length-of-everything-online-according-to-science

Words per Tweet

Fap102.png

From the graph above, most users tend to tweet 11 words or 17 words. As this is the “usual” number of words of tweet, companies may want to look into following the same pattern.

Length of Words per tweet

Fap103.png

In each word, users tend to tweet words with five, seven and ten characters for their tweet.

Unique Words per tweet

Fap104.png

Users tend to tweet 11 or 17 unique words per tweet. If we compare with the distribution of words in a tweet, users tend to tweet 11 or 17 words. Hence, majority of them did not repeat any words in their tweet which resulted in high similarity. However, users also tend to have 22 unique words as compared to their 24 posted words. This may happened as the longer the tweet, the harder it is not to repeat any words, resulting in shorter unique words.

Distribution of Hashtag per tweet

Fap105.png Fap106.png

It is interesting to note of the hashtag popularity in twitter. From the figure above, we can observe that majority of the users have used at least one hashtag and as long as 8 hashtags in their post. Hence, it is important for companies to realize the importance of hashtag in ensuring that their posts can be grouped to increase the eyeball reach.

Mentions per tweet

Fap107.png Fap108.png

Although the users tend to use hashtag, retweet may not be popular in this particular Iphone tweets. Apple may want to promote a new marketing line in Twitter that will encourage people to retweet their posts which will increase their popularity.

Links per tweet

Fap109.png Fap110.png

Users tend to have 1 link in their tweet. This may be due to the character limitations of Twitter and hence users tend to redirect their readers to another website to share more about their posts.

No of words vs Characters

Fap111.png

Data frame is used to create the relationship and plot in the graph using ggplot(). Based on the graph, we can see that there is a relationship between the number of words and the number of characters. However, the correlation value is 0.73 which may not be significant enough. This may due to people are using more words in their tweet to convey their message rather than lengthening their individual words. In order to verify the claim, we will investigate the number of words in tweet vs the length of the tweets

No of words vs Length of Words

Fap112.png

Based on the graph, there is a slight correlation (-0.624) between the number of words and the length. Hence, the longer the tweet, the shorter the words in it will be.

Lexical Diversity

Lexical diversity reflects the range of diversity in vocabulary used by the user in twitter. The insight can tell us how much word variety is there in the iphone tweet pattern. Lexical diversity is measured by the number of unique tokens/ number of total tokens.

Fap113.png

Based on the result, there seems to be low lexical diversity in the iPhone tweets which may suggest that the topic of interest is highly common

20 Most Used Words

Fap114.png

In order to come up with this analysis, all the words are sorted in decreasing order and the top 20 words are chosen. Some of the most frequent words are mentioning IPhone and Iphone6 news. However as this is the raw data, words are not cleaned yet. To investigate words further, topic creation analysis will be explored in the next section

Topic Creation

Frequent Terms

As the data has already been properly cleaned, frequent terms can be analyzed further. Firstly, the lowfreq is set into 50 which indicate the words that appear at least 50 times. The result is as below:

Fap100.png

If we increase the rule to freq is at least 500, we are able to concentrate on a few words that may lead to topic creation of different topic such as “app”, “win”, and camera”

Fap115.png

Topic Groupings

Topic creation is done using the algorithm to detect each word and grouped in based on the cluster that it belongs to. Various number of cluster (k) are tried to get the optimum number. Thereafter, the top 10 words in each topic are listed for analysis.

5 Topics

Fap116.png

Based on 5 topics, there seem to be distinct topics emerging:

Fap118a.png

4 Topics

Fap118.png Fap119.png

Most of the topic remains except the topic with Apple review. Since it is a fairly important independent topic, 5 topics are deemed to be more suitable and give more insight to the users. We will use 5 topics in for further analysis.

6 Topics

Fap120.png Fap121.png

If 6 topics are created, there seems to be a repeat in topic 3 and 5. Hence, 5 topics is deemed to be the best option to continue for the analysis


Topic Comparison and Contrast

How does iPhone Tweets differ from Samsung Tweets?

Fap122.png

Based on the analysis, Samsung tends to mentions about the Galaxy product and its price as compare to iPhone which talks about apps available for users to download.

How does iPhone Tweets similar to Samsung Tweets?

Fap123.png

It seems that the common tweets are talking about the “new” product and also the available accessories for purchase such as “leather”, “case’, “ebay”. If we clean the data further by removing the keywords of “leather”, “case”, “ebay”, “new”, the commonality cloud will change accordingly as below

Fap124.png

In this, the common topics seem to be the chance of winning iPhone and the price surrounding the two brands.

Time Series Analysis

In oder to understand tweets progress over time, time series analysis is explored to find insightful tweet’s pattern.

How does the Tweet volume varies overtime?

Iteration1:

Fap125.png

Problem:

  • Chart difficult to see due to the proportion of the black graph.
  • Aggregate time is unknown and default by R.
  • Does not tell us about each topic evolution

How does the Tweet volume varies overtime?

Iteration2:

Fap126.png Fap127.png

We are able to see the proportion of each topic in the overall volume of tweets and in comparison with other topics. Topic 4 (“accessories”) tend to dominate the tweet volume.

Problem:

  • What is shown is not absolute value. Hence it is difficult to deduce
  • Unable to see whether there is a common pattern of tweet volume between topics

What is the volume of tweets for each topic over time?

Iteration3:

Fap128.png

The graph above shows us the absolute value of each topic over time. Topic 4 (“accessories”) tend to have a large volume of tweet follow by topic 5 (“App”) while the lowest is topic 2 (“Apple review”)


Is there any pattern of tweet volume among the topics?

Fap129.png

The graph shows us the percentage of each tweet with the total volume. This gives us a clearer picture on the proportion of each topic with the tweet volume. It is again apparent that topic 4 (“accessories”) has the majority of the tweets. But it is interesting to see that most of the time, the topics follow a certain pattern and the proportion remains almost the same over time.

Sentiment Analysis

Fap130.png

Sentiment word list

Hu &Liu have published “opinion lexicon” which categorizes 6800 words into positive and negative and which can be downloaded Positive: cool, great, good, amazing Negative: sucks, awful, worst, nightmare Add a few industry specific and/or especially emphatic terms

Fap131.png

Score sentiment for each tweet The scoring uses an algorithm which assigns a simple counting of the number of occurrences of “positive” and “negative” words in tweet. To score all the brand, we just need to feed into score.sentiment()

Fap132.png

Plot Histogram

Fap133.png

Repeat the same procedure for Samsung and plot the histogram:

Fap134.png

What about the extreme value?

Extreme value is defined as very positive (>2) and very negative (<-2) tweets:

Fap135.png

Combine all the scores in one frame indicating their scores:

Fap136.png

For each brand + code, let’s use the ratio of very positive to very negative tweets as the overall sentiment score for each brand

Fap137.png Fap138.png

Based on the result, it seems that iPhone has better sentiment than Samsung

Fap139.png

Is the analysis accurate?

To get a better understanding about the accuracy of the algorithm, the real customer satisfaction is compared. The data can be found in ACSI (America Customer Satisfaction Index) website. The data provides us the products from the both Samsung and iPhone. The average is taken to calculate the actual satisfaction index for the purpose of comparison.

Fap140.png

Average: Samsung = 80 Iphone= 81

Hence, iPhone has a better satisfaction index than Samsung which supports the findings in the algorithm. Although the difference between the two brands is small in ACSI as compared the algorithm, the ACSI data is reflects in May which differs from our tweeter data which targets the recent iPhone 6 launch.