ANLY482 Team wiki: 2015T2 TeamROLL Data Anlysis/Post

From Analytics Practicum
Jump to navigation Jump to search
T(eam)ROLL.png

Teamroll home.png   HOME

 

Teamroll.png   ABOUT US

 

Teamroll overview.png   PROJECT OVERVIEW

 

Teamroll this.png   DATA ANALYSIS

 

Teamroll mgmt.png   PROJECT MANAGEMENT

 

Teamroll doc.png   DOCUMENTATION

Data Cleaning Page-Level Analysis Post-Level Analysis


Exploratory Data Analysis: Time of day over average post reach

Teamroll post1.png

The bars show the count of posts that were released during specific times of the day, while the line shows the average post reach of all posts released during those timings.
We observe that majority of posts were released in the evening between 6pm-12am. However, the average post reach also declines in the evening. Lunch hours (12pm) has the highest post reach. Going home hours (6pm) also has a relatively high post reach.

Exploratory Data Analysis: Top performing posts in Total reach

Teamroll post2.png

The control chart above shows a detailed breakdown of posts posted at different times of the day. Posts within the red box performed exceptionally well since they are in the top 1% of posts which performed well. These 22 posts are:

Teamroll post3.png

We do not observe any distinct trends in topics among these posts.

Exploratory Data Analysis: No. of Negative Feedback Over Likes

Teamroll post4.png

This scatterplot shows the relationship between the number of likes a post receives, as well the number of unlikes which the same post receives.
In general, we observe a linear trend where posts who received many likes also received a higher number of unlikes. These are positively correlated. Through the use of two control lines marking the top 1% of posts in both likes and unlikes, we segment the plot into four quadrants. The posts highlighted in red belonging to the bottom right quadrant indicates posts who had a high amount of likes, but an exceptionally low number of unlikes.

Teamroll post5.png

Of these five posts, we do not observe any distinct trend in topics.
The posts highlighted in yellow on the scatterplot, belonging to the top left quadrant indicates posts who had a low amount of likes, but an exceptionally high number of unlikes.

Teamroll post6.png

Of these 11 posts, we noticed that 5 posts indicated with a "*" shared a common topic about the Chinese seventh month and of ghosts. Such content may not have been favoured by superstitious Singaporeans, hence their exceptionally high amounts of unlikes.


Exploratory Data Analysis: Character Design vs No. of Likes

Teamroll post7.png

In general, Trolls/Memes forme the majority of SGAG's design character in posts. Of the other minor design characters used, "Politicians" and "Foreign Celebrities" appear to be more popular among audiences as well.

Exploratory Data Analysis: No. of Frames Design vs No. of Likes

Teamroll post8.png

Most posts have 1, 2, or 3 frames in their design. Generally, as the number of frames increased, average number of likes decreased as well. Posts with more than 3 frames appear to garner more likes, perhaps because they are richer in information.

Exploratory Data Analysis: No. of Description Lines Design vs No. of Likes

Teamroll post9.png

Many posts tended to have more than three description lines. In general, as the number of description lines increased, so did the average number of likes. However, for posts with more than three description lines, the average number of likes decreased slightly, perhaps because audience found such posts increasingly wordy.

Topic Modelling

From the identified word clusters, we formulated the following topic names:

Teamroll post10.png

Overall, a total of 39 overarching and sub-topics were generated. Of which, 27 are considered overarching topics, and 12 are considered sub-topics.


Topic Modelling: Topic over Number of Posts

Teamroll post11.png

The 5 topics which were most prevalent over many posts were "Singaporean Life", "Breaking News", "Credited", "Making Fun/Puns" and "Relationships".

Teamroll post12.png

The 5 topics which were had the highest average likes were "Breaking News", "Singaporean Life", "Credited", "Making Fun/Puns" and "Relationships".

Topic Modelling: Topic vs Likes Distribution - Boxplot

Teamroll post13.png

By comparing the boxplots of each topic, we can observe the effect of their distributions, especially for potential topics which have a higher than median number of likes. Three potential topics are "Breaking News", "Name to Shame/Honour", and "Foreign Talents/Workers".

Topic Modelling: Sub-Topic Breaking News

Teamroll post15.png
Teamroll post16.png

There are five important sub-topics under the topic "Breaking News". Of these, there were the most posts about #GE2015. However, in terms of performance, posts relating to #RIPLKY generally had more likes.


Topic Modelling: Sub-Topic Commerce

Teamroll post17.png
Teamroll post18.png

There are also five important sub-topics under the topic "Commerce". Of these, there were the most posts about Carousell. In terms of performance, posts relating to Carousell also fared the best with a much higher median.


Topic Modelling: No. of Topics per post vs Average Likes

Teamroll post18.png

Most posts have between 1-3 topics. Sometimes, a single post may have multiple topics attached to it.

Teamroll post19.png

There would be the topics "Breaking News" and "Media Entertainment". This is because this piece of news was trending around media and news outlets on the same day ("Breaking News"), yet, it is about a local artiste and celebrity, hence "Media Entertainment".

Teamroll post20.png

However, we observe that there is a slight increase in average number of likes as the number of topics per post increased from 1 to 3. Perhaps audience members enjoy richer posts which convey a variety of topics. Although the number of average likes peaked at No. of Topics = 5, this is probably an anomaly since extremely few posts actually encompassed five topics.

Regression Analysis

Our main software tool for analysis was JMP Pro.

Firstly, we analysed all the variables' correlation to no. of likes. While the full correlation table may be found in Appendix (I), 11 variables which had significant correlations were:

Teamroll post21.png

Some variables which are positively correlated are "No. of Description Lines", "Name to Shame/Honour", "Breaking News", "Politicians", "Credited" and "Making Fun/Puns".
Next, we constructed a multi-linear regression model to test the effect of the above 11 variables. The model results were:

Teamroll post22.png

Overall, these 11 variables accounted for 8.85% of variations in the number of likes among posts. Of these, the topics "Name to Shame/Honour", "MGAG" and "SGAG Challenge/Troll" appear to have the greatest impact on increasing/decreasing number of likes. This can be seen from their high estimate values of 2725, (-1235) and (-1123) respectively.
Next, we tested a full model with all of our topics and design attributes. The results were are summarised as follows:

Teamroll post23.png

The full model had an improved adjusted Rsquare of 9.36% from the previous 8.85%. Among significant variables, the topics "Name to Shame/Honour", "MGAG" and "SGAG Challenge/Troll" still have the greatest impact on increasing/decreasing number of likes. Their new estimate values are 2704, (-1336) and (-1093). We note that these estimate values are different from the estimate values in the first regression model, a likely indication that our variables, particularly topics, are not truly independent of each other. Likewise, whereas the character "Politician" was previously a significant variable, it is not longer significant at the 5% level in the full model. Instead, the characters "Troll Faces/Memes" and "Movie Characters" are new significant variables which is negatively related to number of likes. These variations again indicate likely inter-variable influences and thus non-independent variables in our model.