ANLY482 Team wiki: 2015T2 TeamROLL Project Overview/Description

From Analytics Practicum
Jump to navigation Jump to search
T(eam)ROLL.png

Teamroll home.png   HOME

 

Teamroll.png   ABOUT US

 

Teamroll this.png   PROJECT OVERVIEW

 

Teamroll analysis.png   DATA ANALYSIS

 

Teamroll mgmt.png   PROJECT MANAGEMENT

 

Teamroll doc.png   DOCUMENTATION

Description Methodology Technology


Topic Framework

SGAG's original basic topical framework for content creation was segmented based on age and occupation. Based these segments, the team sampled some posts from SGAG on derived a basic list of potential topics for our framework.

Teamroll tag1.png

However, upon testing the above framework on a small collection of SGAG's posts (~approx. 50 posts), we realised that the framework could not sufficiently capture a lot of other prominent themes. Given the richness of SGAG's content in capturing various aspects of Singaporean life, the above topics were insufficient and we expanded the selection to include other prominent themes:

Teamroll tag2.png

We then tested our expanded framework on a much larger collection of SGAG's posts (~approx. 300 posts). Again, the team realised that although the most salient aspects of a post's content was captured, even the expanded framework was unable to capture smaller nuances in content, such as a play on words, or the intent to flag out undesirable behaviour, etc.

To continue to expand the topical framework and dummy code each individual post according to all the possible topics and themes would result in far too many topical categories and possibly overlapping themes. Another severe limitation to the fixed topical framework would be its difficulty to constantly accomodate new emerging themes and phase out older obsolete themes. Such dynamism would be important to SGAG where as a social media content provider, audience preferences change frequently and it is important for SGAG to stay relevant.

After much discussion, the team decided that a fixed topical framework would not appropriate for analysing such a diverse and dynamic content dataset. Instead, a different topic identification method is required, and the team proposes to use "tagging".

Tagging Posts - Benefits and Limitations
The team turns to the option of "tagging" content as a flexible way to identify and analyse dynamic content. The notion of "tagging" in our project is similar to the technique of metadata tagging. The concept is not uncommon and can be found in various forms of online content. One example is the Korean celebrity gossip content website, allkpop.com which uses article tags to identify celebrities mentioned in the article, or key programs that were featured. Through these tags, readers can easily get an overall idea of the of content represented in the article. Similarly, through tagging of topics to SGAG's posts, the team too is able to identify topics represented in these pictorial content.

Tagging Posts - Benefits
A key benefit of using tagging in our analysis is its flexibility to be defined by authors. Any amount of tags may be used, added or dropped, which allows for dynamic identification of new and old content, without being limited to a fixed framework. For instance, if a new post reflects a new form of content in the form of humour stories related to a new movie, "Star Wars: The Force Awakens", then a new tag can be created to reflect this - tag "Star wars, TFA". Details and nuances in content can thus be easily captured and readily used for analysis later on.

Another key benefit of tagging is its innate ability to represent the main ideas of a picture post, in textual form. This is important for the analysis of SGAG's content because there are very limited tools available in the market for rapid picture analysis and segmentation. While textual data can more easily afford deep and insightful analysis through the use of text mining techniques, the same cannot be applied to pictorial data. Using unique tags to convert pictorial data into textual data is an effective way of rendering pictorial data more easily analysed in large quantities. This is important in our study of thousands of pictorial data. Limitations & Mitigation
One of the main criticisms and limitations of the "tagging" method is its non-standardisability - the fact that there is no controlled vocabulary system. Analysts noted that since tags are defined by users in a flexible manner, similar ideas in a piece of content may be tagged with different words by different users; the use of synonyms. For instance, to identify an article about "Despicable Me", different users may use the terms "movie", "film" or "cartoon" to represent the same idea. Across multiple collaborators it has been noticed that semantics make textual analysis more difficult later.

In order to be able to perform segmentation analysis and topic modeling for our project later on, our team has decided to mitigate this limitation with the dual use of a tagging chart, and flexibility in recording details. For each post, we would first tag posts according to the fixed set of 20 generic topics developed. After which, we would include specific tags to capture smaller nuances present in the posts, such as "word puns", "Star Wars", etc. This standard tagging procedure allows team members to operate within an overarching topical structure (parent themes), which reduces the problem of too many synonymns used, while affording a good level of details (children themes) which can be used for detailed analysis later on.

Teamroll tag3.png

Standard Tagging Procedure - Walkthrough Example
This is an example of how the standard tagging procedure is applied. The post may be obtained at (https://www.facebook.com/sgag.sg/posts/1188289691186017:0).

Teamroll tag4.png
Teamroll tag5.png

Topic Modeling,

The focus of Topic Modeling is to sieve out prominent themes from our topic tags. The team aims to take reference to the work of Lai & To : Social Media Content Analysis, A Grounded Approach (Lai & To, 2015) to discover techniques on how to refine our topics for content analysis in SGAG's perspective.
We used SAS Enterprise Miner's Text Mining tools to analyse the topics present in SGAG's post throughout 2015.

Content Analysis and Regression Modeling

With the above two analysis steps completed, the team will use findings derived from the above analysis as inputs for content analysis and regression modeling. Our main software tool for analysis was JMP Pro.

Limitations

Tagging
Although the tagging method gives a lot of flexibility and granularity in analysing topic shifts in posts, there is also a general lack of standardisation. Across multiple users, the variations in synonyms used also increase, thus increasing the complexity of topic modelling to form compact, independent topics. As observed in our own study, the interchangeable use of the words "credited" and "submissions" to describe the same group of posts which were created by SGAG fans and posted on the SGAG website actually resulted in the formation of 2 separate topics during Topic Modelling on SAS EM, when actually there was only one topic present. Although the team attempted to reduce the variations in synonyms by employing a 2-in-1 tagging framework (fixed framework + detailed tags), these variations still limit the accuracy of software topic modelling in developing significant, independent topics.

Post Categorisation
Observing topics within a post can be considered a subjective exercise - it depends on the person who is exposed to the post content. For instance in the post below:

Teamroll limit1.png
Example of Topic Categorisation Subjectivity

While this post can be interpreted as "Making Fun/Puns", it could also be categorised into "Media Entertainment" (because of the reference to Minions), or for users who did not find the post funny or entertaining, but simply unusual, they may classify this as "Unbelievable/Ridiculous" instead. Thus, the manual and subjective nature of post categorisation could lead to human errors in image analysis and result in inconsistencies during topic modelling.

Limitations in Regression Analysis
Our regression model was only able to predict approximately 9% of variations in the number of likes, which was lower than we hoped for. Furthermore, we noted that there our variables were not independent of each other, and some inter-variable dependencies can be observed. This affected the accuracy and effectiveness of our model. Looking forward, we would suggest that our topic modelling be better refined to produce independent topics. Furthermore, many of the variations in the number of likes cannot be explained by our model. We recommend that future studies expand the number of possible variables and influencers on audience preference. Some possible factors include post-specific audience demographics and the presence or absence of other trending content on Facebook.

Data Size
Lastly, our dataset had >1200 post observations. As advised by Prof. Kam, for text mining and topic modelling to be more effective, a larger dataset of around >5000 posts would be more suitable. As discussed above, we noted that our regression analysis based on modelled topics were also not conclusive. We believe that with a larger dataset, differences between the various topic groups would be more apparent and our regression model would be more significant to interpret. Looking forward, we would advise further studies to be implemented across all the historical data of SGAG, rather than just the year 2015.

Future Work

During our final feedback session with our sponsor, Mr. Karl expressed that many aspects in our study was interesting and useful in helping him better understand how different content designs impacted audience likes, something the current analytics tools he subscribes to was unable to flesh out at present. He would bring these findings back to their production team to evaluate and tweak their content strategies. During our final feedback session with our supervisor, Prof. Kam, and guest instructor, Mr. Prakash, they offered much advice on how this study could be expanded and made more comprehensive. By taking into consideration the various feedback, we propose some future directions for future studies into content analysis for SGAG:

1. Exploring Network Analysis to push content more effectively
a. As noted earlier, Network Analysis is another important and common analysis method for social media studies. Although we were advised to focus on content analysis in this study, network analysis would be an excellent complement to round off this study in both breadth and depth
b. Work in this area can leverage on our current findings in terms of topics and design layout attributes, and observe how various key nodes in the network respond to the different topics and layout attributes, as well as the direction of spread of different content across SGAG's audience base

2. Exploring differences in audience preferences and the impact of content design strategies across other platforms
a. As noted earlier, different platforms had widely differing audience responses. Furthermore, Mr. Karl also noted the diffusion of audience segments across different social media channels (e.g. younger audiences no longer hold Facebook as their preferred social media sites)
b. It would be insightful to analyse content preference differences across the different channels, and evaluate the possibility of differentiating content strategies across the different channels to increase efficiency of the posts
c. Future work can leverage on our current methodology for post categorisation and design attributes as variables for analysis

3. Exploring differences in various similar content sites' strategies and their effectiveness in targeting different audience segments
a. It would be meaningful to do a competitor comparative analysis to see how their content strategies differ from SGAG, and thus, if their audiences are different
b. Some potential competitors in the local Singaporean scene may be SMRT Vigilanteh or Mothership.sg