Difference between revisions of "ANLY482 AY2017-18T2 Group30 LDA-Blog Post"

Latest revision as of 13:42, 10 April 2018

HOME

ABOUT US

PROJECT OVERVIEW

EDA

BUSINESS OBJECTIVES

PROJECT MANAGEMENT

DOCUMENTATION

MAIN PAGE

Facebook Post		Facebook Video		Youtube		Instagram		LDA - Blog Post

Data Source

To retrieve data from the company's posts, we used Scrapy, a fast and powerful open-sourced web-scraper to extract data from the blog. We collected data from the beginning of the first blog post, with the following information:

Timestamp
Author(s)
Headline
Category
URL
Tags

In order to analyse data from multiple sources (FB, Youtube, Instagram, Blog) - we need to to first join them together. One of the methods that we will be using is Latent Dirichlet Allocation (LDA), which is a natural language processing statistical model that allows us to create “topics” from a set of documents.

Reference: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

We will be using the following Python packages: pandas, nltk, stop_words and gensim.

Data Preparation

Data Cleaning

Tokenization

Tokenizing allows us to split sentences into words. The following Regex will match any characters other than a non-word.

   from nltk.tokenize import RegexpTokenizer
   tokenizer = RegexpTokenizer(r'\w+')

Stop Words Certains words of a speech such as “for”, “as”, “the” are meaningless to generating topics, and as such, will be removed from the token list. We will be using the Python package, stop_words, which contains a predefined list of stop words.

   from stop_words import get_stop_words
   en_stop = get_stop_words('en')

Stemming Another NLP technique, stemming, allows us to reduce words that are similar such as “eating”, “eat”, “ate” to the stem “eat”. This is important to reduce the importance in the model. There are several algorithms but we will use the most commonly implemented version: Porter’s Stemming Algorithm.

   from nltk.stem.porter import PorterStemmer
   p_stemmer = PorterStemmer()

Putting it all together, we will have a clean list of text tokens for analysis:

   for i in docs:
       raw = i.lower()
       tokens = tokenizer.tokenize(raw)
       stopped_tokens = [i for i in tokens if not i in en_stop]
       stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
       texts.append(stemmed_tokens)

Exploratory Data Analysis

Refer to Report for Analysis

LDA Model

Refer to Report for Model Evaluation

@@ Line 9: / Line 9: @@
 {|style="background-color:#5A6B96; color:#5A6B96; width="100%" cellspacing="0" cellpadding="10" border="0" |
-|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18_T2_Group_30|<font color="#FFFFFF" size=3><b>HOME</b></font>]]
+|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18_T2_Group_30|<font color="#FFFFFF" size=3><b>HOME</b></font>]]
-|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18T2_Group30 About Us |<font color="#FFFFFF" size=3><b>ABOUT US</b></font>]]
+|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 About Us |<font color="#FFFFFF" size=3><b>ABOUT US</b></font>]]
-|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18T2_Group30 Project Overview |<font color="#FFFFFF" size=3><b>PROJECT OVERVIEW </b></font>]]
+|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Project Overview |<font color="#FFFFFF" size=3><b>PROJECT OVERVIEW </b></font>]]
-|style="font-size:88%; border-left:1px solid #347cc4; border-right:1px solid #347cc4; text-align:center; border-bottom:1px solid #347cc4; border-top:1px solid #347cc4;" width="14%" |[[ANLY482_AY2017-18T2_Group30 Data Analysis |<font color="#347cc4" size=3><b>PROJECT FINDINGS</b></font>]]
+|style="font-size:88%; border-left:1px solid #347cc4; border-right:1px solid #347cc4; text-align:center; border-bottom:1px solid #347cc4; border-top:1px solid #347cc4;" width="12.5%" |[[ANLY482_AY2017-18T2_Group30 Data Analysis |<font color="#347cc4" size=3><b>EDA</b></font>]]
-|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="14%" | [[ANLY482_AY2017-18T2_Group30 Project Management |<font color="#FFFFFF" size=3><b>PROJECT MANAGEMENT</b></font>]]
+|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Business Objectives |<font color="#FFFFFF" size=3><b>BUSINESS OBJECTIVES </b></font>]]
-|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="15%" |[[ANLY482_AY2017-18T2_Group30 Documentation | <font color="#FFFFFF" size=3><b>DOCUMENTATION</b></font>]]
+|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Project Management |<font color="#FFFFFF" size=3><b>PROJECT MANAGEMENT</b></font>]]
-|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="15%" |[[ANLY482_AY2017-18_Term_2 | <font color="#FFFFFF" size=3><b>MAIN PAGE</b></font>]]
+|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" |[[ANLY482_AY2017-18T2_Group30 Documentation | <font color="#FFFFFF" size=3><b>DOCUMENTATION</b></font>]]
+|style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" |[[ANLY482_AY2017-18_Term_2 | <font color="#FFFFFF" size=3><b>MAIN PAGE</b></font>]]
 |}
 </center>
@@ Line 41: / Line 43: @@
 ! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
-! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #8c8d94" width="150px"| [[ANLY482_AY2017-18T2_Group30 LDA-Blog Post| <font color="#347cc4" size=2 face="Century Gothic">Blog Post]]
+! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #8c8d94" width="150px"| [[ANLY482_AY2017-18T2_Group30 LDA-Blog Post| <font color="#347cc4" size=2 face="Century Gothic">LDA - Blog Post]]
 ! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"|
@@ Line 47: / Line 49: @@
 <!---------------END of sub menu ---------------------->
 <br>
-<div align="center">
-<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Source</font></div>
-<div style="width:90%;">
-<font style="text-align: left">
-<p>
-<b>Facebook</b><br>
-For data files from <i>Facebook Insights Data Export (Post Level)</i>, the sponsor provided exported data from different periods of the year, with different metric tabs in Excel format.  The tabs included are:
-# Key Metrics
-# Lifetime: Number of unique people who have created a story about your Page post by interacting with it (unique users)
-# Lifetime: Number of people who have clicked anywhere in your post, by type (unique users)
-# Lifetime: Number of people who have given negative feedback on your post, by type (unique users)
-<br></p>
-<font style="text-align: left">
+==<div style=" width: 96.5%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Source</font></div>==
-<p>
-For data files from <i>Facebook Insights Data Export (Video Post)</i>, the sponsor provided exported data from different periods of the year, with different metric tabs in Excel format.  The tabs included are:
-# Lifetime Post Total Impression/Reach/Views
-# Geographic Views
-# Demographic Views
-# Lifetime Post Toal Views by (page_owned / Shared)
-</p></font>
-<div style="width:90%;">
+<div style="width:96.5%;">
-<font style="text-align: left">
+<font>
-<b>YouTube</b><br/>
-For data files from <i>YouTube(Watch Time)</i>, the sponsor provided exported data for Watch Time, with different metric tabs in Excel format. The tabs included are:
-# Video
-#Geography
-#Date
-#Subscription Status
-#Youtube Product
-#Device Type
-#Subtitles and CC
-#Video Information Language
-<br>
-For data files from <i>YouTube(Demographics)</i>, the sponsor provided exported data for watch time for different Demographic, with different metric tabs in Excel format. The tabs included are:
-# Viewer Age
-# Viewer Gender
-<br>
-For data files from <i>YouTube(Traffic Sources)</i>, the sponsor provided exported data for watch time from different traffic source type
-</font>
-</div>
-<br/><b>Instagram</b><br/>
-To retrieve data from the company's instagram, we made use of a web-scraping script from [https://github.com/timgrossmann/instagram-profilecrawl Github]. We made modifications to the script to include timestamp as well as caption, the data includes:
-* Caption
-* Timestamp
-* Img URL
-* Tags
-* No. of Likes
-* No. of Comments
-<br/><b>Blog</b><br/>
 To retrieve data from the company's posts, we used [https://scrapy.org/ Scrapy], a fast and powerful open-sourced web-scraper to extract data from the blog. We collected data from the beginning of the first blog post, with the following information:
 * Timestamp
@@ Line 108: / Line 62: @@
 * Tags
-</font></div>
+In order to analyse data from multiple sources (FB, Youtube, Instagram, Blog) - we need to to first join them together. One of the methods that we will be using is Latent Dirichlet Allocation (LDA), which is a natural language processing statistical model that allows us to create “topics” from a set of documents.
+Reference: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
+We will be using the following Python packages: pandas, nltk, stop_words and gensim.
+</font>
+</div>
 <br/>
-<div align="center">
-<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Preparation</font></div>
-<div style="width:90%;">
-<font style="text-align: left">
-<p>
-To help us have an overview of the data throughout the year, we consolidated the various tabs, whilst concatenating the various periods of data for the same columns, into one combined file. This was carried out using the software, IBM JMP Pro, in the following steps:
-* With Post ID, Permalink (permanent link of the campaign content), Post Message, Type, Countries and Posted columns as key identifiers among the different tabs for the excel files, we appended desired columns from the other tabs to the end of the Key Metrics. They included the Share, Like, Comment columns from Tab 2; Other Clicks, Link Clicks, Photo View, Video Play columns from Tab 3; Hide_Clicks , Hide_all_clicks, Unlike_page_clicks, report_spam_clicks columns from Tab 4. <br>This was conducted using the <i>Tables > Join </i>function, with “Matching Specification” as the key identifiers and “Output Columns” of the appended desired columns.
-* Next, for each period of data files (appended with new columns) from multiple tabs, we concatenate the data across different time periods to have a full year collection of data.<br>This was conducted using the <i>Tables > Concatenate </i> function, while adding multiple data tables into “Data Tables to be Concatenated”.
+==<div style=" width: 96.5%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Preparation</font></div>==
+<div style="width:96.5%;">
+<font>
+'''Data Cleaning'''
+Tokenization
+Tokenizing allows us to split sentences into words. The following Regex will match any characters other than a non-word.
+    from nltk.tokenize import RegexpTokenizer
+    tokenizer = RegexpTokenizer(r'\w+')
+Stop Words
+Certains words of a speech such as “for”, “as”, “the” are meaningless to generating topics, and as such, will be removed from the token list. We will be using the Python package, stop_words, which contains a predefined list of stop words.
+    from stop_words import get_stop_words
+    en_stop = get_stop_words('en')
+Stemming
+Another NLP technique, stemming, allows us to reduce words that are similar such as “eating”, “eat”, “ate” to the stem “eat”. This is important to reduce the importance in the model. There are several algorithms but we will use the most commonly implemented version: Porter’s Stemming Algorithm.
+    from nltk.stem.porter import PorterStemmer
+    p_stemmer = PorterStemmer()
-* Finally, we check for missing data in the different columns. For example, under the column Type, we have five different types, namely: Link, Photo, Shared Video, Status and Video. However, in the instances of missing data, we will cross check with the permalink of the campaign post, and check the Type of medium was posted and fill it in accordingly.
-</p>
-</font></div>
-<br/>
-<div align="center">
+Putting it all together, we will have a clean list of text tokens for analysis:
-<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Cleaning</font></div>
-<div style="width:90%;">
-<div align="left">
+    for i in docs:
-<b>Instagram Data</b><br/>
+        raw = i.lower()
-After scraping the data, we realised that the data needed cleaning. The indexes of the column values were off as seen here:
+        tokens = tokenizer.tokenize(raw)
-(image)
+        stopped_tokens = [i for i in tokens if not i in en_stop]
-We also concatenated the "tags" into a single column.
+        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
-</div></div>
+        texts.append(stemmed_tokens)
-<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Exploratory Data Analysis</font></div>
+</font>
+</div>
 <br/>
-<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Final Application: Learning Dashboard</font></div>
+==<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Exploratory Data Analysis</font></div>==
+<div style="width:96.5%;">
+<font>
+Refer to Report for Analysis
+</font>
+</div>
 <br/>
+==<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">LDA Model</font></div>==
+<div style="width:96.5%;">
+<font>
+Refer to Report for Model Evaluation
+</font>
 </div>
+<br/>

Difference between revisions of "ANLY482 AY2017-18T2 Group30 LDA-Blog Post"

Latest revision as of 13:42, 10 April 2018

Contents

Data Source

Data Preparation

Exploratory Data Analysis

LDA Model

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools