Difference between revisions of "ANLY482 AY2017-18T2 Group30 Instagram"

From Analytics Practicum
Jump to navigation Jump to search
(Created page with "<!--Team Logo--> center|300px| <!--End of Team Logo--> <br/> <!--Main Navigation--> <center> {|style="background-color:#5A6B96; color:#5A6B96; width=...")
 
 
(9 intermediate revisions by 3 users not shown)
Line 9: Line 9:
 
{|style="background-color:#5A6B96; color:#5A6B96; width="100%" cellspacing="0" cellpadding="10" border="0" |
 
{|style="background-color:#5A6B96; color:#5A6B96; width="100%" cellspacing="0" cellpadding="10" border="0" |
  
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18_T2_Group_30|<font color="#FFFFFF" size=3><b>HOME</b></font>]]
+
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18_T2_Group_30|<font color="#FFFFFF" size=3><b>HOME</b></font>]]
  
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18T2_Group30 About Us |<font color="#FFFFFF" size=3><b>ABOUT US</b></font>]]
+
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 About Us |<font color="#FFFFFF" size=3><b>ABOUT US</b></font>]]
  
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18T2_Group30 Project Overview |<font color="#FFFFFF" size=3><b>PROJECT OVERVIEW </b></font>]]
+
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Project Overview |<font color="#FFFFFF" size=3><b>PROJECT OVERVIEW </b></font>]]
  
|style="font-size:88%; border-left:1px solid #347cc4; border-right:1px solid #347cc4; text-align:center; border-bottom:1px solid #347cc4; border-top:1px solid #347cc4;" width="14%" |[[ANLY482_AY2017-18T2_Group30 Data Analysis |<font color="#347cc4" size=3><b>PROJECT FINDINGS</b></font>]]
+
|style="font-size:88%; border-left:1px solid #347cc4; border-right:1px solid #347cc4; text-align:center; border-bottom:1px solid #347cc4; border-top:1px solid #347cc4;" width="12.5%" |[[ANLY482_AY2017-18T2_Group30 Data Analysis |<font color="#347cc4" size=3><b>EDA</b></font>]]
  
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18T2_Group30 Project Management |<font color="#FFFFFF" size=3><b>PROJECT MANAGEMENT</b></font>]]
+
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Business Objectives |<font color="#FFFFFF" size=3><b>BUSINESS OBJECTIVES </b></font>]]
  
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="15%" |[[ANLY482_AY2017-18T2_Group30 Documentation | <font color="#FFFFFF" size=3><b>DOCUMENTATION</b></font>]]
+
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Project Management |<font color="#FFFFFF" size=3><b>PROJECT MANAGEMENT</b></font>]]
  
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="15%" |[[ANLY482_AY2017-18_Term_2 | <font color="#FFFFFF" size=3><b>MAIN PAGE</b></font>]]
+
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" |[[ANLY482_AY2017-18T2_Group30 Documentation | <font color="#FFFFFF" size=3><b>DOCUMENTATION</b></font>]]
 +
 
 +
|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" |[[ANLY482_AY2017-18_Term_2 | <font color="#FFFFFF" size=3><b>MAIN PAGE</b></font>]]
 
|}  
 
|}  
 
</center>
 
</center>
Line 47: Line 49:
 
<!---------------END of sub menu ---------------------->
 
<!---------------END of sub menu ---------------------->
 
<br>
 
<br>
<div align="center">
 
<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Source</font></div>
 
<div style="width:90%;">
 
<font style="text-align: left">
 
<p>
 
<b>Facebook</b><br>
 
For data files from <i>Facebook Insights Data Export (Post Level)</i>, the sponsor provided exported data from different periods of the year, with different metric tabs in Excel format.  The tabs included are:
 
# Key Metrics
 
# Lifetime: Number of unique people who have created a story about your Page post by interacting with it (unique users)
 
# Lifetime: Number of people who have clicked anywhere in your post, by type (unique users)
 
# Lifetime: Number of people who have given negative feedback on your post, by type (unique users)
 
<br></p>
 
  
<font style="text-align: left">
+
==<div style=" width: 96.5%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Source</font></div>==
<p>
 
For data files from <i>Facebook Insights Data Export (Video Post)</i>, the sponsor provided exported data from different periods of the year, with different metric tabs in Excel format.  The tabs included are:
 
# Lifetime Post Total Impression/Reach/Views
 
# Geographic Views
 
# Demographic Views
 
# Lifetime Post Toal Views by (page_owned / Shared)
 
</p></font>
 
  
<div style="width:90%;">
+
<div style="width:96.5%;">
<font style="text-align: left">
+
<font>
<b>YouTube</b><br/>
 
For data files from <i>YouTube(Watch Time)</i>, the sponsor provided exported data for Watch Time, with different metric tabs in Excel format. The tabs included are:
 
# Video
 
#Geography
 
#Date
 
#Subscription Status
 
#Youtube Product
 
#Device Type
 
#Subtitles and CC
 
#Video Information Language
 
<br>
 
For data files from <i>YouTube(Demographics)</i>, the sponsor provided exported data for watch time for different Demographic, with different metric tabs in Excel format. The tabs included are:
 
# Viewer Age
 
# Viewer Gender
 
<br>
 
For data files from <i>YouTube(Traffic Sources)</i>, the sponsor provided exported data for watch time from different traffic source type
 
</font>
 
</div>
 
 
 
<br/><b>Instagram</b><br/>
 
 
To retrieve data from the company's instagram, we made use of a web-scraping script from [https://github.com/timgrossmann/instagram-profilecrawl Github]. We made modifications to the script to include timestamp as well as caption, the data includes:
 
To retrieve data from the company's instagram, we made use of a web-scraping script from [https://github.com/timgrossmann/instagram-profilecrawl Github]. We made modifications to the script to include timestamp as well as caption, the data includes:
 
* Caption
 
* Caption
Line 98: Line 61:
 
* No. of Likes
 
* No. of Likes
 
* No. of Comments
 
* No. of Comments
 +
</font></div>
 +
 +
==<div style=" width: 96.5%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Preparation</font></div>==
 +
 +
<div style="width:96.5%;">
 +
<font>
 +
After scraping the data, we realised that the data needed cleaning. The indexes of the column values were off as seen here:
 +
 +
[[File:Ig bad data.png|1000px|center]]
 +
 +
We also concatenated the "tags" into a single column.
 +
</font>
 +
</div>
 +
 +
 +
<div style=" width: 96.5%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Exploratory Data Analysis</font></div>
 +
 +
<div style="width:96.5%;">
 +
<font>
 +
The dataset contains Instagram posts collected from August 1, 2016 to February 1, 2018. From the chart below, we can observe that this is a constant decline of number of posts over the time.
 +
 +
[[File:Ig posts over time.png|805x507px|center]]
 +
 +
However, despite the decline in number of posts, there is a steady increase of average number of “Likes” over time. This can be possibly due to better quality of posts as opposed to quantity.
 +
 +
[[File:Ig_average_likes_over_time.png|800x515px|center]]
  
<br/><b>Blog</b><br/>
+
Analysing the number of posts by Day of the Week tells us that most content are posted on Thursday. There are a considerably less number of posts on the weekends -- which they can consider scheduling their posts on to garner more likes and reach out to more people. We can see that the average number of Likes are spread out quite evenly over the days, which means that more posts should be equally spread out over the days as well. However, the average number seems to be significantly higher on Thursdays.
To retrieve data from the company's posts, we used [https://scrapy.org/ Scrapy], a fast and powerful open-sourced web-scraper to extract data from the blog. We collected data from the beginning of the first blog post, with the following information:
 
* Timestamp
 
* Author(s)
 
* Headline
 
* Category
 
* URL
 
* Tags
 
  
</font></div>
+
[[File:Ig_posts_vs_likes_weekday.png|800x465px|center]]
<br/>
 
  
<div align="center">
+
Further analyses shows that there is an outlier single post garnering over 80,000 likes which has skewed the average. As such we will be removing this data point for further analyses. Upon further investigation, we realised that this is not the number of “likes” for an image, but rather “views”. As Instagram recently added videos to post content, the data will contain videos as well, which cannot be distinguished from images using the current dataset. As such, we will have to make modifications to the crawler script to identify videos and add a metric for “views”.
<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Preparation</font></div>
 
<div style="width:90%;">
 
<font style="text-align: left">
 
<p>
 
To help us have an overview of the data throughout the year, we consolidated the various tabs, whilst concatenating the various periods of data for the same columns, into one combined file. This was carried out using the software, IBM JMP Pro, in the following steps:
 
* With Post ID, Permalink (permanent link of the campaign content), Post Message, Type, Countries and Posted columns as key identifiers among the different tabs for the excel files, we appended desired columns from the other tabs to the end of the Key Metrics. They included the Share, Like, Comment columns from Tab 2; Other Clicks, Link Clicks, Photo View, Video Play columns from Tab 3; Hide_Clicks , Hide_all_clicks, Unlike_page_clicks, report_spam_clicks columns from Tab 4. <br>This was conducted using the <i>Tables > Join </i>function, with “Matching Specification” as the key identifiers and “Output Columns” of the appended desired columns.  
 
  
* Next, for each period of data files (appended with new columns) from multiple tabs, we concatenate the data across different time periods to have a full year collection of data.<br>This was conducted using the <i>Tables > Concatenate </i> function, while adding multiple data tables into “Data Tables to be Concatenated”.  
+
[[File:Ig_likes_distribution.png|678x800px|center]]
  
* Finally, we check for missing data in the different columns. For example, under the column Type, we have five different types, namely: Link, Photo, Shared Video, Status and Video. However, in the instances of missing data, we will cross check with the permalink of the campaign post, and check the Type of medium was posted and fill it in accordingly.  
+
Similarly, we observe the number of posts by the Hour of the Day. Surprisingly, between 12pm to 12am, there are very few posts as compared to the other half of the day. With this, we would like to study if the time of posting affects the total number of likes and comments received.
</p>
 
</font></div>
 
<br/>
 
  
<div align="center">
+
[[File:Ig_post_over_hour.png|800x530px|center]]
<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Cleaning</font></div>
 
<div style="width:90%;">
 
  
<div align="left">
+
Looking at the Average number of Likes over the Hour of the day, there are not much significant differences. The slightly lower number of average likes from 12pm to 12am could be due to insufficient data points.
<b>Instagram Data</b><br/>
 
After scraping the data, we realised that the data needed cleaning. The indexes of the column values were off as seen here:
 
(image)
 
We also concatenated the "tags" into a single column.
 
</div></div>
 
  
 +
[[File:Ig_average_likes_hourly.png|800x516px|center]]
  
<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Exploratory Data Analysis</font></div>
+
By plotting a histogram for the number of comments, we see a right skewed distribution. We see that most posts garner around 0-2 comments, with posts rarely having more than 7 comments. Using this data, we created an additional column to categorize the data:
<br/>
+
0-2: Low Comments, 3-6: Medium Comments, 7 and Above: High Comments
  
 +
[[File:Ig_comments_distribution.png|763x600px|center]]
  
<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Final Application: Learning Dashboard</font></div>
+
In addition we would also like to investigate if the number of tags in a posts would affect the performance. From the text columns of hashtags, we created a numeric variable which contains the number of tags a post has. The chart below shows a distribution of the number of tags, with most posts having tag counts of 0 - 3.
<br/>
+
 +
[[File:Ig_tags_distribution.png|800x514px|center]]
  
 +
</font>
 
</div>
 
</div>

Latest revision as of 14:25, 10 April 2018

APex Logo.PNG


HOME ABOUT US PROJECT OVERVIEW EDA BUSINESS OBJECTIVES PROJECT MANAGEMENT DOCUMENTATION MAIN PAGE
Facebook Post Facebook Video Youtube Instagram Blog Post


Data Source

To retrieve data from the company's instagram, we made use of a web-scraping script from Github. We made modifications to the script to include timestamp as well as caption, the data includes:

  • Caption
  • Timestamp
  • Img URL
  • Tags
  • No. of Likes
  • No. of Comments

Data Preparation

After scraping the data, we realised that the data needed cleaning. The indexes of the column values were off as seen here:

Ig bad data.png

We also concatenated the "tags" into a single column.


Exploratory Data Analysis

The dataset contains Instagram posts collected from August 1, 2016 to February 1, 2018. From the chart below, we can observe that this is a constant decline of number of posts over the time.

Ig posts over time.png

However, despite the decline in number of posts, there is a steady increase of average number of “Likes” over time. This can be possibly due to better quality of posts as opposed to quantity.

Ig average likes over time.png

Analysing the number of posts by Day of the Week tells us that most content are posted on Thursday. There are a considerably less number of posts on the weekends -- which they can consider scheduling their posts on to garner more likes and reach out to more people. We can see that the average number of Likes are spread out quite evenly over the days, which means that more posts should be equally spread out over the days as well. However, the average number seems to be significantly higher on Thursdays.

Ig posts vs likes weekday.png

Further analyses shows that there is an outlier single post garnering over 80,000 likes which has skewed the average. As such we will be removing this data point for further analyses. Upon further investigation, we realised that this is not the number of “likes” for an image, but rather “views”. As Instagram recently added videos to post content, the data will contain videos as well, which cannot be distinguished from images using the current dataset. As such, we will have to make modifications to the crawler script to identify videos and add a metric for “views”.

Ig likes distribution.png

Similarly, we observe the number of posts by the Hour of the Day. Surprisingly, between 12pm to 12am, there are very few posts as compared to the other half of the day. With this, we would like to study if the time of posting affects the total number of likes and comments received.

Ig post over hour.png

Looking at the Average number of Likes over the Hour of the day, there are not much significant differences. The slightly lower number of average likes from 12pm to 12am could be due to insufficient data points.

Ig average likes hourly.png

By plotting a histogram for the number of comments, we see a right skewed distribution. We see that most posts garner around 0-2 comments, with posts rarely having more than 7 comments. Using this data, we created an additional column to categorize the data: 0-2: Low Comments, 3-6: Medium Comments, 7 and Above: High Comments

Ig comments distribution.png

In addition we would also like to investigate if the number of tags in a posts would affect the performance. From the text columns of hashtags, we created a numeric variable which contains the number of tags a post has. The chart below shows a distribution of the number of tags, with most posts having tag counts of 0 - 3.

Ig tags distribution.png