Difference between revisions of "ANLY482 Team wiki: 2015T2 TeamROLL Data Analysis"

From Analytics Practicum
Jump to navigation Jump to search
 
Line 39: Line 39:
  
 
| style="padding:0 .3em;  solid #000000;  padding: 10px; text-align:center; background-color:#daeeff; border-right:0px solid #4690cd; " width="33%" | [[ANLY482 Team wiki: 2015T2 TeamROLL Data Analysis| <font face = "Arial" color="#3C2415"><b>Data Cleaning</b></font>]]
 
| style="padding:0 .3em;  solid #000000;  padding: 10px; text-align:center; background-color:#daeeff; border-right:0px solid #4690cd; " width="33%" | [[ANLY482 Team wiki: 2015T2 TeamROLL Data Analysis| <font face = "Arial" color="#3C2415"><b>Data Cleaning</b></font>]]
| style="padding:0 .3em;  solid #000000;  padding: 10px; text-align:center; background-color:white; border-right:0px solid #4690cd; " width="33%" | [[ANLY482 Team wiki: 2015T2 TeamROLL Data Anlysis/Page| <font face = "Arial" color="#3C2415"><b>Page EDA</b></font>]]
+
| style="padding:0 .3em;  solid #000000;  padding: 10px; text-align:center; background-color:white; border-right:0px solid #4690cd; " width="33%" | [[ANLY482 Team wiki: 2015T2 TeamROLL Data Anlysis/Page| <font face = "Arial" color="#3C2415"><b>Page-Level Analysis</b></font>]]
| style="padding:0 .3em;  solid #000000; text-align:center; background-color:white; border-right:0px solid #4690cd; " width="34%" | [[ANLY482 Team wiki: 2015T2 TeamROLL Data Anlysis/Post| <font face = "Arial" color="#101010"><b>Post EDA</b></font>]]
+
| style="padding:0 .3em;  solid #000000; text-align:center; background-color:white; border-right:0px solid #4690cd; " width="34%" | [[ANLY482 Team wiki: 2015T2 TeamROLL Data Anlysis/Post| <font face = "Arial" color="#101010"><b>Post-Level Analysis</b></font>]]
 
|
 
|
  
Line 46: Line 46:
  
 
<!--END OF Sub-Navigation-->
 
<!--END OF Sub-Navigation-->
 +
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Data Collection</font></div>==
 +
Comparing the different number of followers for SGAG’s various social media platforms (namely Facebook, Twitter, Instagram and Youtube), we have noted that the majority of their followers, with up to 53.2%, are Facebook users. We have thus, decide to narrow our project scope to SGAG’s Facebook data. <br>
 +
Subsequently, the team gathered SGAG’s lifetime data (years 2012 to 2016) from facebook insights. However, for our analysis, we have selected page and post data for the year 2015. Since the raw data extracted included a wide number of variables, the team have selected only relevant variables that will be used for our analysis.
 +
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Data Cleaning Log</font></div>==
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Data Cleaning Log</font></div>==
 
<center><b>  
 
<center><b>  
Line 54: Line 58:
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Page Data</font></div>==
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Page Data</font></div>==
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean1.png|600px|center]] </div>
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean1.png|600px|center]] </div>
Firstly, attributes were selected for analysis, which were
+
<b>Attribute Selection </b><br>
 +
We have discussed the scope of our analysis with our project sponsor which resulted in the following selected attributes for analysis:
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean2.png|600px|center]] </div>
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean2.png|600px|center]] </div>
Other attributes were removed from analysis after discussion with our project sponsor. The main reasons for doing so would be: 1) our focus on net rather than organic metrics, 2) redundancy of paid posts as no posts were paid for, 3) redundancy of check-in metrics since this was not a hotel or destination services business, and 4) the omission of video formats for current analysis which is limited to pictorial posts.
+
We have added a column titled ‘Engaged Users’ to identify the proportion of users who have engaged with the page over the number of users who have seen content publish from the page. This allow us to identify the day and post that generated higher engagement level. <br>
 +
Other attributes were removed because:<br>
 +
1) Focus on net rather than organic metrics <br>
 +
2) Redundancy of paid posts as no posts were paid for <br>
 +
3) Redundancy of check-in metrics since this was not a hotel or destination services business <br>
 +
4) Omission of video, link and status post types formats for current analysis which is limited to pictorial posts <br>
 
<br>
 
<br>
Secondly, we checked the file for missing data entries. Since data was recorded on a daily basis, we found that all days were accounted for our period of review, 1 Jan 2015 - 31 Dec 2015. Thus, there was no need to treat any missing data points.
+
<b>Merge excel worksheets into one single file </b><br>
<br>
+
Using Microsoft Excel, we merged the various excel files into a single worksheet.
Lastly, with the above two processes in place, we now have our completed Page-Level analytical data cube ready to be used for data analysis.
 
  
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Post Data</font></div>==
 
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Post Data</font></div>==
Data preparation for Post Level data was more complicated due to its granularity and larger number of observations recorded.  
+
Similarly, the post-level data was exported in batches of excel files. Due to the larger number of observations recorded, data preparation for post level data was more complicated.  
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean3.png|600px|center]] </div>
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean3.png|600px|center]] </div>
Firstly, post level data extraction resulted in a combination of data sheets each recording different aspects of post performance. Some examples are: Key performance metrics.tab, Lifetime talking about this.tab, Lifetime Negative Feedback.tab, etc. These different metrics are identified for individual posts via unique Post IDs. Thus, the function =index(match) on Excel was used to recombine all these metrics into a single data sheet, matched by Post IDs.<br><br>
+
Post level data extracted from Facebook Insights resulted in a combination of data sheets each recording different aspects of post-performance. These different metrics are identified for individual posts via unique Post IDs.<br>
Secondly, attributes were selected for analysis, which were
+
<b>Attributes Selection </b> <br>
 +
Based on our business and analytical objectives, we have narrowed the large number of tabs and columns and selected the following:
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean4.png|600px|center]] </div>
 
<div style="padding-bottom:25px;"> [[File:teamroll_clean4.png|600px|center]] </div>
Similar to page level attributes selection, other attributes were removed from analysis after discussion with our project sponsor. The main reasons for doing so would also be: 1) focus on net rather than organic metrics, 2) redundancy of paid posts as no posts were paid for, 3) the omission of video formats for current analysis which is limited to pictorial posts, and 4) only direct engagement response metrics "like", "share", "comment" and direct negative feedback metrics "hide_all_clicks", "hide_clicks", "report_spam_clicks" and "unlike_page_clicks", were included in addition to general performance metrics since the other optional additional attributes were already well represented by the general performance metrics.<br><br>
+
Similar to page level attributes selection, other attributes were removed from analysis after discussion with our project sponsor. The main reasons for doing so would also be: <br>
Thirdly, of the three types of post formats recorded ("Types"), we selected only "photo" and removed "links" and "video" formats. "Photo" indicated the bulk of SGAG's content, which are memes. "Link" and "Video" indicated secondary content types of listicles and youtube videos, which will not be the focus of our study.<br><br>
+
1) Focus on net rather than organic metrics<br>
Fourthly, each selected post was tagged according to their topics and design attributes. As discussed above, we used a textual tagging system to indicate post topics and themes. For design attributes, once again in consultation with SGAG, we identified three main areas of design, namely 1) character use, 2) number of frames, and 3) number of description lines within the picture. Character use had 6 main character types, and dummy coding was used. Number of frames was indicated with either "1", "2" "3" or ">3", since most of the posts were designed to attempt to fall within three or less frames. Similarly, number of description lines is also indicated with either "1", "2", "3" or ">3". <br><br>
+
2) Redundancy of paid posts as no posts were paid for<br>
Fifthly, the team noticed that the attributed "Posted" which recorded the date and time of post release contained a large number of posts which were released between 000h-0600h daily. This is strange as these are the sleeping hours of Singaporeans and would thus make no sense for anyone to be releasing the posts so late at night. Through online research, we found that Facebook Insights recorded "Posted" according to Pacific Time rather than local time. As such, there was a need to recode "Posted" forward by 16 hours to match with local SG time. This was done by adding 16 hours ( =+0.667) to the previous recorded time. With a check on the newly calculated local time, majority of posts were released within the expected timings of 0900-2100h in local time.<br><br>
+
3) Omission of video formats for current analysis which is limited to pictorial posts<br>
Sixthly, the team examined the data for missing values and found some observations with missing values for performance attributes. The cause for these missing values is not known, though we suspect the cause to be another limitation in Facebook Insights attribute retrieval for specific types of posts, such as linked posts. However, the number of such missing values is very small, comprising around 2% of our dataset. As such, we have decided to omit these missing values during our analysis. <br><br>
+
4) Only direct engagement response metrics number of the number of "like" and lifetime negative feedback to identify weak and good performing posts. <br><br>
Lastly, with the above processes in place, we now have our completed Post-Level analytical data cube ready to be used for data analysis.
+
 
 +
<b>Columns Creation </b><br>
 +
Implementing the tagging framework, we have added the “Tags” column to specify the topics related to the post. <br>
 +
For design attributes, once again in consultation with SGAG, we identified three main areas of design, namely <br>
 +
1) Character used: Animals, Local Characters, Foreign Celebs, Troll faces/Memes, Movie Characters, Politicians <br>
 +
2) Number of frames; indicated with either "1", "2" "3" or ">3", since most of the posts were designed to attempt to fall within three or less frames <br>
 +
3) Number of description lines within the picture; indicated with either "1", "2", "3" or ">3".<br>
 +
 
 +
<b>Recode Date Posted to Singapore Time </b> <br>
 +
The team noticed that the attributed "Posted" which recorded the date and time of post release contained a large number of posts which were released between 000h-0600h daily. This is strange as these are the sleeping hours of Singaporeans and would thus make no sense for anyone to be releasing the posts so late at night. Through online research, we found that Facebook Insights recorded "Posted" according to Pacific Time rather than local time. As such, there was a need to recode "Posted" forward by 16 hours to match with local SG time. This was done by adding 16 hours ( =+0.667) to the previous recorded time. With a check on the newly calculated local time, majority of posts were released within the expected timings of 0900-2100h in local time.<br><br>
 +
<b>Check for missing entries </b><br>
 +
The team examined the data for missing values and found some observations with missing values for performance attributes. The cause for these missing values is not known, though we suspect the cause to be another limitation in Facebook Insights attribute retrieval for specific types of posts, such as linked posts. However, the number of such missing values is very small, comprising around 2% of our dataset. As such, we have decided to omit these missing values during our analysis.  
 +
With the above processes in place, we now have our completed Post-Level analytical data cube ready to be used for data analysis.<br><br>
 +
 
 +
==<div style="background: #2196F3; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size:24px; border-left: #0D47A1 solid 32px;"><font color="white">Topic Modelling</font></div>==
 +
 
 +
We used SAS Enterprise Miner's Text Mining tools to analyse the topics present in SGAG's post throughout 2015. The steps to topic mine can be found in the SAS EM diagram below:<br>
 +
<div style="padding-bottom:25px;"> [[File:teamroll_clean5.png|600px|center]] </div>
 +
First, we parsed the text tags, followed by text filtering and lastly text topic modelling. The text topic nodes were used twice, firstly to identify overarching topics, and the second time to build user-defined sub topics. Some examples of the latter included "Commerce" related topics such as "scoot" and "carousell". <br><br>
 +
The interactive topic viewer was used to do a brief check on the posts' topic classification. After which, the posts and their identified topics were saved to an excel spreadsheet. Topics which had <50 posts in their category were removed from the list as we felt these could be considered largely insignificant topics. However, we retained some of these topics, namely "National Service", "Police" and "Foreign Talents/Worker" as these were more relevant to SGAG's content creation framework.

Latest revision as of 13:20, 18 April 2016

T(eam)ROLL.png

Teamroll home.png   HOME

 

Teamroll.png   ABOUT US

 

Teamroll overview.png   PROJECT OVERVIEW

 

Teamroll this.png   DATA ANALYSIS

 

Teamroll mgmt.png   PROJECT MANAGEMENT

 

Teamroll doc.png   DOCUMENTATION

Data Cleaning Page-Level Analysis Post-Level Analysis

Data Collection

Comparing the different number of followers for SGAG’s various social media platforms (namely Facebook, Twitter, Instagram and Youtube), we have noted that the majority of their followers, with up to 53.2%, are Facebook users. We have thus, decide to narrow our project scope to SGAG’s Facebook data.
Subsequently, the team gathered SGAG’s lifetime data (years 2012 to 2016) from facebook insights. However, for our analysis, we have selected page and post data for the year 2015. Since the raw data extracted included a wide number of variables, the team have selected only relevant variables that will be used for our analysis.

Data Cleaning Log

Teamroll log1.png

Data Cleaning Log

Page Data

Teamroll clean1.png

Attribute Selection
We have discussed the scope of our analysis with our project sponsor which resulted in the following selected attributes for analysis:

Teamroll clean2.png

We have added a column titled ‘Engaged Users’ to identify the proportion of users who have engaged with the page over the number of users who have seen content publish from the page. This allow us to identify the day and post that generated higher engagement level.
Other attributes were removed because:
1) Focus on net rather than organic metrics
2) Redundancy of paid posts as no posts were paid for
3) Redundancy of check-in metrics since this was not a hotel or destination services business
4) Omission of video, link and status post types formats for current analysis which is limited to pictorial posts

Merge excel worksheets into one single file
Using Microsoft Excel, we merged the various excel files into a single worksheet.

Post Data

Similarly, the post-level data was exported in batches of excel files. Due to the larger number of observations recorded, data preparation for post level data was more complicated.

Teamroll clean3.png

Post level data extracted from Facebook Insights resulted in a combination of data sheets each recording different aspects of post-performance. These different metrics are identified for individual posts via unique Post IDs.
Attributes Selection
Based on our business and analytical objectives, we have narrowed the large number of tabs and columns and selected the following:

Teamroll clean4.png

Similar to page level attributes selection, other attributes were removed from analysis after discussion with our project sponsor. The main reasons for doing so would also be:
1) Focus on net rather than organic metrics
2) Redundancy of paid posts as no posts were paid for
3) Omission of video formats for current analysis which is limited to pictorial posts
4) Only direct engagement response metrics number of the number of "like" and lifetime negative feedback to identify weak and good performing posts.

Columns Creation
Implementing the tagging framework, we have added the “Tags” column to specify the topics related to the post.
For design attributes, once again in consultation with SGAG, we identified three main areas of design, namely
1) Character used: Animals, Local Characters, Foreign Celebs, Troll faces/Memes, Movie Characters, Politicians
2) Number of frames; indicated with either "1", "2" "3" or ">3", since most of the posts were designed to attempt to fall within three or less frames
3) Number of description lines within the picture; indicated with either "1", "2", "3" or ">3".

Recode Date Posted to Singapore Time
The team noticed that the attributed "Posted" which recorded the date and time of post release contained a large number of posts which were released between 000h-0600h daily. This is strange as these are the sleeping hours of Singaporeans and would thus make no sense for anyone to be releasing the posts so late at night. Through online research, we found that Facebook Insights recorded "Posted" according to Pacific Time rather than local time. As such, there was a need to recode "Posted" forward by 16 hours to match with local SG time. This was done by adding 16 hours ( =+0.667) to the previous recorded time. With a check on the newly calculated local time, majority of posts were released within the expected timings of 0900-2100h in local time.

Check for missing entries
The team examined the data for missing values and found some observations with missing values for performance attributes. The cause for these missing values is not known, though we suspect the cause to be another limitation in Facebook Insights attribute retrieval for specific types of posts, such as linked posts. However, the number of such missing values is very small, comprising around 2% of our dataset. As such, we have decided to omit these missing values during our analysis. With the above processes in place, we now have our completed Post-Level analytical data cube ready to be used for data analysis.

Topic Modelling

We used SAS Enterprise Miner's Text Mining tools to analyse the topics present in SGAG's post throughout 2015. The steps to topic mine can be found in the SAS EM diagram below:

Teamroll clean5.png

First, we parsed the text tags, followed by text filtering and lastly text topic modelling. The text topic nodes were used twice, firstly to identify overarching topics, and the second time to build user-defined sub topics. Some examples of the latter included "Commerce" related topics such as "scoot" and "carousell".

The interactive topic viewer was used to do a brief check on the posts' topic classification. After which, the posts and their identified topics were saved to an excel spreadsheet. Topics which had <50 posts in their category were removed from the list as we felt these could be considered largely insignificant topics. However, we retained some of these topics, namely "National Service", "Police" and "Foreign Talents/Worker" as these were more relevant to SGAG's content creation framework.