Difference between revisions of "ANLY482 Team wiki: 2015T2 TeamROLL Project Overview/Description"

From Analytics Practicum
Jump to navigation Jump to search
Line 77: Line 77:
 
With some of the suggested methodology in mind, we will meet with Prof. Kam shortly to discuss the methods used, and seek further advice on how they may be suitably applied in our project. A key technique likely to be used is text mining, with SAS Enterprise Miner being the main software tool.
 
With some of the suggested methodology in mind, we will meet with Prof. Kam shortly to discuss the methods used, and seek further advice on how they may be suitably applied in our project. A key technique likely to be used is text mining, with SAS Enterprise Miner being the main software tool.
  
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Cluster Analysis</font></div>==
 
The team expects to use cluster analysis to segment and profile content posts according to their performance indicators. Despite preliminary attempts to use the k-means clustering method to segment our data, the team found that clustering results were less than ideal, since it frequently resulted in one large supercluster, and multiple small clusters. As such, we have decided to revert back to deeper exploratory analysis to better understand the dynamics of performance indicators in our dataset, to mitigate the distribution of these indicators before k-means clustering is attempted again. At the same time, we may explore other clustering methods, such as nearest neighbour or Wald's to identify outliers or anomalies in the dataset. Similarly, the team aims to meet Prof. Kam shortly to seek clarification on these methods so as to achieve better execution. The main analysis tool will also be SAS Enterprise Guide or JMP.<br>
 
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Content Analysis and Regression Modeling</font></div>==
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Content Analysis and Regression Modeling</font></div>==
 
With the above two analysis steps completed, the team will use findings derived from the above analysis as inputs for content analysis and regression modeling. Further discussion is required to clarify how content analysis and regression modeling is to be achieved and what further data preparation is required to do so. However, the team currently expects that cluster profiles will reflect certain prominent topics that contribute to the performance of such posts.
 
With the above two analysis steps completed, the team will use findings derived from the above analysis as inputs for content analysis and regression modeling. Further discussion is required to clarify how content analysis and regression modeling is to be achieved and what further data preparation is required to do so. However, the team currently expects that cluster profiles will reflect certain prominent topics that contribute to the performance of such posts.

Revision as of 22:37, 18 April 2016

T(eam)ROLL.png

Teamroll home.png   HOME

 

Teamroll.png   ABOUT US

 

Teamroll this.png   PROJECT OVERVIEW

 

Teamroll analysis.png   DATA ANALYSIS

 

Teamroll mgmt.png   PROJECT MANAGEMENT

 

Teamroll doc.png   DOCUMENTATION

Description Methodology Technology


Topic Framework

SGAG's original basic topical framework for content creation was segmented based on age and occupation. Based these segments, the team sampled some posts from SGAG on derived a basic list of potential topics for our framework.

Teamroll tag1.png

However, upon testing the above framework on a small collection of SGAG's posts (~approx. 50 posts), we realised that the framework could not sufficiently capture a lot of other prominent themes. Given the richness of SGAG's content in capturing various aspects of Singaporean life, the above topics were insufficient and we expanded the selection to include other prominent themes:

Teamroll tag2.png

We then tested our expanded framework on a much larger collection of SGAG's posts (~approx. 300 posts). Again, the team realised that although the most salient aspects of a post's content was captured, even the expanded framework was unable to capture smaller nuances in content, such as a play on words, or the intent to flag out undesirable behaviour, etc.

To continue to expand the topical framework and dummy code each individual post according to all the possible topics and themes would result in far too many topical categories and possibly overlapping themes. Another severe limitation to the fixed topical framework would be its difficulty to constantly accomodate new emerging themes and phase out older obsolete themes. Such dynamism would be important to SGAG where as a social media content provider, audience preferences change frequently and it is important for SGAG to stay relevant.

After much discussion, the team decided that a fixed topical framework would not appropriate for analysing such a diverse and dynamic content dataset. Instead, a different topic identification method is required, and the team proposes to use "tagging".

Tagging Posts - Benefits and Limitations
The team turns to the option of "tagging" content as a flexible way to identify and analyse dynamic content. The notion of "tagging" in our project is similar to the technique of metadata tagging. The concept is not uncommon and can be found in various forms of online content. One example is the Korean celebrity gossip content website, allkpop.com which uses article tags to identify celebrities mentioned in the article, or key programs that were featured. Through these tags, readers can easily get an overall idea of the of content represented in the article. Similarly, through tagging of topics to SGAG's posts, the team too is able to identify topics represented in these pictorial content.

Tagging Posts - Benefits
A key benefit of using tagging in our analysis is its flexibility to be defined by authors. Any amount of tags may be used, added or dropped, which allows for dynamic identification of new and old content, without being limited to a fixed framework. For instance, if a new post reflects a new form of content in the form of humour stories related to a new movie, "Star Wars: The Force Awakens", then a new tag can be created to reflect this - tag "Star wars, TFA". Details and nuances in content can thus be easily captured and readily used for analysis later on.

Another key benefit of tagging is its innate ability to represent the main ideas of a picture post, in textual form. This is important for the analysis of SGAG's content because there are very limited tools available in the market for rapid picture analysis and segmentation. While textual data can more easily afford deep and insightful analysis through the use of text mining techniques, the same cannot be applied to pictorial data. Using unique tags to convert pictorial data into textual data is an effective way of rendering pictorial data more easily analysed in large quantities. This is important in our study of thousands of pictorial data. Limitations & Mitigation
One of the main criticisms and limitations of the "tagging" method is its non-standardisability - the fact that there is no controlled vocabulary system. Analysts noted that since tags are defined by users in a flexible manner, similar ideas in a piece of content may be tagged with different words by different users; the use of synonyms. For instance, to identify an article about "Despicable Me", different users may use the terms "movie", "film" or "cartoon" to represent the same idea. Across multiple collaborators it has been noticed that semantics make textual analysis more difficult later.

In order to be able to perform segmentation analysis and topic modeling for our project later on, our team has decided to mitigate this limitation with the dual use of a tagging chart, and flexibility in recording details. For each post, we would first tag posts according to the fixed set of 20 generic topics developed. After which, we would include specific tags to capture smaller nuances present in the posts, such as "word puns", "Star Wars", etc. This standard tagging procedure allows team members to operate within an overarching topical structure (parent themes), which reduces the problem of too many synonymns used, while affording a good level of details (children themes) which can be used for detailed analysis later on.

Teamroll tag3.png

Standard Tagging Procedure - Walkthrough Example
This is an example of how the standard tagging procedure is applied. The post may be obtained at (https://www.facebook.com/sgag.sg/posts/1188289691186017:0).

Teamroll tag4.png
Teamroll tag5.png

Topic Modeling

The focus of Topic Modeling is to sieve out prominent themes from our topic tags. The team aims to take reference to the work of Lai & To : Social Media Content Analysis, A Grounded Approach (Lai & To, 2015) to discover techniques on how to refine our topics for content analysis in SGAG's perspective.
With some of the suggested methodology in mind, we will meet with Prof. Kam shortly to discuss the methods used, and seek further advice on how they may be suitably applied in our project. A key technique likely to be used is text mining, with SAS Enterprise Miner being the main software tool.

Content Analysis and Regression Modeling

With the above two analysis steps completed, the team will use findings derived from the above analysis as inputs for content analysis and regression modeling. Further discussion is required to clarify how content analysis and regression modeling is to be achieved and what further data preparation is required to do so. However, the team currently expects that cluster profiles will reflect certain prominent topics that contribute to the performance of such posts.