Twitter Analytics: Findings
Descriptive analysis
Characters per tweet
The mean and median of the character of the Tweet is around 120 characters while the maximum is 208 characters.
Based on research done by Buddy Media and Track Social, engagement rate is optimum at 100 characters in a tweet. Hence, companies should look into not only maximizing their tweet characters post but also the optimum of it.
http://blog.bufferapp.com/the-ideal-length-of-everything-online-according-to-science
Words per Tweet
From the graph above, most users tend to tweet 11 words or 17 words. As this is the “usual” number of words of tweet, companies may want to look into following the same pattern.
Length of Words per tweet
In each word, users tend to tweet words with five, seven and ten characters for their tweet.
Unique Words per tweet
Users tend to tweet 11 or 17 unique words per tweet. If we compare with the distribution of words in a tweet, users tend to tweet 11 or 17 words. Hence, majority of them did not repeat any words in their tweet which resulted in high similarity. However, users also tend to have 22 unique words as compared to their 24 posted words. This may happened as the longer the tweet, the harder it is not to repeat any words, resulting in shorter unique words.
Distribution of Hashtag per tweet
It is interesting to note of the hashtag popularity in twitter. From the figure above, we can observe that majority of the users have used at least one hashtag and as long as 8 hashtags in their post. Hence, it is important for companies to realize the importance of hashtag in ensuring that their posts can be grouped to increase the eyeball reach.
Mentions per tweet
Although the users tend to use hashtag, retweet may not be popular in this particular Iphone tweets. Apple may want to promote a new marketing line in Twitter that will encourage people to retweet their posts which will increase their popularity.
Links per tweet
Users tend to have 1 link in their tweet. This may be due to the character limitations of Twitter and hence users tend to redirect their readers to another website to share more about their posts.
No of words vs Characters
Data frame is used to create the relationship and plot in the graph using ggplot(). Based on the graph, we can see that there is a relationship between the number of words and the number of characters. However, the correlation value is 0.73 which may not be significant enough. This may due to people are using more words in their tweet to convey their message rather than lengthening their individual words. In order to verify the claim, we will investigate the number of words in tweet vs the length of the tweets
No of words vs Length of Words
Based on the graph, there is a slight correlation (-0.624) between the number of words and the length. Hence, the longer the tweet, the shorter the words in it will be.
Lexical Diversity
Lexical diversity reflects the range of diversity in vocabulary used by the user in twitter. The insight can tell us how much word variety is there in the iphone tweet pattern. Lexical diversity is measured by the number of unique tokens/ number of total tokens.
Based on the result, there seems to be low lexical diversity in the iPhone tweets which may suggest that the topic of interest is highly common
20 Most Used Words
In order to come up with this analysis, all the words are sorted in decreasing order and the top 20 words are chosen. Some of the most frequent words are mentioning IPhone and Iphone6 news. However as this is the raw data, words are not cleaned yet. To investigate words further, topic creation analysis will be explored in the next section
Topic Creation
Frequent Terms
As the data has already been properly cleaned, frequent terms can be analyzed further. Firstly, the lowfreq is set into 50 which indicate the words that appear at least 50 times. The result is as below:
If we increase the rule to freq is at least 500, we are able to concentrate on a few words that may lead to topic creation of different topic such as “app”, “win”, and camera”
Topic Groupings
Topic creation is done using the algorithm to detect each word and grouped in based on the cluster that it belongs to. Various number of cluster (k) are tried to get the optimum number. Thereafter, the top 10 words in each topic are listed for analysis.
5 Topics
Based on 5 topics, there seem to be distinct topics emerging:
4 Topics
Most of the topic remains except the topic with Apple review. Since it is a fairly important independent topic, 5 topics are deemed to be more suitable and give more insight to the users. We will use 5 topics in for further analysis.
6 Topics
If 6 topics are created, there seems to be a repeat in topic 3 and 5. Hence, 5 topics is deemed to be the best option to continue for the analysis
Topic Comparison and Contrast
How does iPhone Tweets differ from Samsung Tweets?
Based on the analysis, Samsung tends to mentions about the Galaxy product and its price as compare to iPhone which talks about apps available for users to download.
How does iPhone Tweets similar to Samsung Tweets?
It seems that the common tweets are talking about the “new” product and also the available accessories for purchase such as “leather”, “case’, “ebay”. If we clean the data further by removing the keywords of “leather”, “case”, “ebay”, “new”, the commonality cloud will change accordingly as below
In this, the common topics seem to be the chance of winning iPhone and the price surrounding the two brands.
Time Series Analysis
In oder to understand tweets progress over time, time series analysis is explored to find insightful tweet’s pattern.
How does the Tweet volume varies overtime?
Iteration1:
Problem:
- Chart difficult to see due to the proportion of the black graph.
- Aggregate time is unknown and default by R.
- Does not tell us about each topic evolution
How does the Tweet volume varies overtime?
Iteration2:
We are able to see the proportion of each topic in the overall volume of tweets and in comparison with other topics. Topic 4 (“accessories”) tend to dominate the tweet volume.
Problem:
- What is shown is not absolute value. Hence it is difficult to deduce
- Unable to see whether there is a common pattern of tweet volume between topics
What is the volume of tweets for each topic over time?
Iteration3:
The graph above shows us the absolute value of each topic over time. Topic 4 (“accessories”) tend to have a large volume of tweet follow by topic 5 (“App”) while the lowest is topic 2 (“Apple review”)
Is there any pattern of tweet volume among the topics?
The graph shows us the percentage of each tweet with the total volume. This gives us a clearer picture on the proportion of each topic with the tweet volume. It is again apparent that topic 4 (“accessories”) has the majority of the tweets. But it is interesting to see that most of the time, the topics follow a certain pattern and the proportion remains almost the same over time.
Sentiment Analysis
Sentiment word list
Hu &Liu have published “opinion lexicon” which categorizes 6800 words into positive and negative and which can be downloaded Positive: cool, great, good, amazing Negative: sucks, awful, worst, nightmare Add a few industry specific and/or especially emphatic terms
Score sentiment for each tweet The scoring uses an algorithm which assigns a simple counting of the number of occurrences of “positive” and “negative” words in tweet. To score all the brand, we just need to feed into score.sentiment()
Plot Histogram
Repeat the same procedure for Samsung and plot the histogram:
What about the extreme value?
Extreme value is defined as very positive (>2) and very negative (<-2) tweets:
Combine all the scores in one frame indicating their scores:
For each brand + code, let’s use the ratio of very positive to very negative tweets as the overall sentiment score for each brand
Based on the result, it seems that iPhone has better sentiment than Samsung
Is the analysis accurate?
To get a better understanding about the accuracy of the algorithm, the real customer satisfaction is compared. The data can be found in ACSI (America Customer Satisfaction Index) website. The data provides us the products from the both Samsung and iPhone. The average is taken to calculate the actual satisfaction index for the purpose of comparison.
Average: Samsung = 80 Iphone= 81
Hence, iPhone has a better satisfaction index than Samsung which supports the findings in the algorithm. Although the difference between the two brands is small in ACSI as compared the algorithm, the ACSI data is reflects in May which differs from our tweeter data which targets the recent iPhone 6 launch.
Social Network Analysis
In order to understand the how users interact with one another; a social network analysis is done with NodeXL as a free software.
Data Collection
Data is collected from NodeXL from with 8159 Twitter users (edges) and 19055 connections (vertices) from 5th November 2014 15:52:50 to 6th November 2014 01:36:01 (GMT 0). This is a different set of data used in the previous analysis. This is to validate whether there is similar pattern between the post on 6th October and 5th November (1 month apart)
Data Cleansing I
Data is cleaned to remove people who do not mention anyone or retweeted anyone in the group. Hence they are classified as disconnected.
After the first round of cleaning, 4556 vertices left for analysis which is suitable in terms of the data size for analysis.