Difference between revisions of "ANLY482 AY2017-18T2 Group30 LDA-Blog Post"
(4 intermediate revisions by 2 users not shown) | |||
Line 9: | Line 9: | ||
{|style="background-color:#5A6B96; color:#5A6B96; width="100%" cellspacing="0" cellpadding="10" border="0" | | {|style="background-color:#5A6B96; color:#5A6B96; width="100%" cellspacing="0" cellpadding="10" border="0" | | ||
− | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width=" | + | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18_T2_Group_30|<font color="#FFFFFF" size=3><b>HOME</b></font>]] |
− | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width=" | + | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 About Us |<font color="#FFFFFF" size=3><b>ABOUT US</b></font>]] |
− | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width=" | + | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Project Overview |<font color="#FFFFFF" size=3><b>PROJECT OVERVIEW </b></font>]] |
− | |style="font-size:88%; border-left:1px solid #347cc4; border-right:1px solid #347cc4; text-align:center; border-bottom:1px solid #347cc4; border-top:1px solid #347cc4;" width=" | + | |style="font-size:88%; border-left:1px solid #347cc4; border-right:1px solid #347cc4; text-align:center; border-bottom:1px solid #347cc4; border-top:1px solid #347cc4;" width="12.5%" |[[ANLY482_AY2017-18T2_Group30 Data Analysis |<font color="#347cc4" size=3><b>EDA</b></font>]] |
− | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width=" | + | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Business Objectives |<font color="#FFFFFF" size=3><b>BUSINESS OBJECTIVES </b></font>]] |
− | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width=" | + | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" | [[ANLY482_AY2017-18T2_Group30 Project Management |<font color="#FFFFFF" size=3><b>PROJECT MANAGEMENT</b></font>]] |
− | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width=" | + | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" |[[ANLY482_AY2017-18T2_Group30 Documentation | <font color="#FFFFFF" size=3><b>DOCUMENTATION</b></font>]] |
+ | |||
+ | |style="font-size:88%; border-left:1px solid #ffffff; border-right:1px solid #ffffff; text-align:center; background-color:#347cc4; " width="12.5%" |[[ANLY482_AY2017-18_Term_2 | <font color="#FFFFFF" size=3><b>MAIN PAGE</b></font>]] | ||
|} | |} | ||
</center> | </center> | ||
Line 41: | Line 43: | ||
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"| | ! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"| | ||
− | ! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #8c8d94" width="150px"| [[ANLY482_AY2017-18T2_Group30 LDA-Blog Post| <font color="#347cc4" size=2 face="Century Gothic">Blog Post]] | + | ! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #8c8d94" width="150px"| [[ANLY482_AY2017-18T2_Group30 LDA-Blog Post| <font color="#347cc4" size=2 face="Century Gothic">LDA - Blog Post]] |
! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"| | ! style="font-size:15px; text-align: center; border-top:solid #ffffff; border-bottom:solid #ffffff" width="20px"| | ||
Line 47: | Line 49: | ||
<!---------------END of sub menu ----------------------> | <!---------------END of sub menu ----------------------> | ||
<br> | <br> | ||
− | + | ||
− | <div style=" width: | + | ==<div style=" width: 96.5%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Source</font></div>== |
− | + | ||
− | + | <div style="width:96.5%;"> | |
− | <div style=" width: | ||
<font> | <font> | ||
To retrieve data from the company's posts, we used [https://scrapy.org/ Scrapy], a fast and powerful open-sourced web-scraper to extract data from the blog. We collected data from the beginning of the first blog post, with the following information: | To retrieve data from the company's posts, we used [https://scrapy.org/ Scrapy], a fast and powerful open-sourced web-scraper to extract data from the blog. We collected data from the beginning of the first blog post, with the following information: | ||
Line 60: | Line 61: | ||
* URL | * URL | ||
* Tags | * Tags | ||
+ | |||
+ | In order to analyse data from multiple sources (FB, Youtube, Instagram, Blog) - we need to to first join them together. One of the methods that we will be using is Latent Dirichlet Allocation (LDA), which is a natural language processing statistical model that allows us to create “topics” from a set of documents. | ||
+ | |||
+ | Reference: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html | ||
+ | |||
+ | We will be using the following Python packages: pandas, nltk, stop_words and gensim. | ||
+ | |||
+ | |||
</font> | </font> | ||
− | |||
</div> | </div> | ||
<br/> | <br/> | ||
− | + | ||
− | <div style=" width: | + | ==<div style=" width: 96.5%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Data Preparation</font></div>== |
− | + | ||
− | + | <div style="width:96.5%;"> | |
− | <div style=" width: | ||
<font> | <font> | ||
− | + | ||
+ | '''Data Cleaning''' | ||
+ | |||
+ | Tokenization | ||
+ | |||
+ | Tokenizing allows us to split sentences into words. The following Regex will match any characters other than a non-word. | ||
+ | |||
+ | from nltk.tokenize import RegexpTokenizer | ||
+ | tokenizer = RegexpTokenizer(r'\w+') | ||
+ | |||
+ | Stop Words | ||
+ | Certains words of a speech such as “for”, “as”, “the” are meaningless to generating topics, and as such, will be removed from the token list. We will be using the Python package, stop_words, which contains a predefined list of stop words. | ||
+ | |||
+ | from stop_words import get_stop_words | ||
+ | en_stop = get_stop_words('en') | ||
+ | |||
+ | Stemming | ||
+ | Another NLP technique, stemming, allows us to reduce words that are similar such as “eating”, “eat”, “ate” to the stem “eat”. This is important to reduce the importance in the model. There are several algorithms but we will use the most commonly implemented version: Porter’s Stemming Algorithm. | ||
+ | |||
+ | from nltk.stem.porter import PorterStemmer | ||
+ | p_stemmer = PorterStemmer() | ||
+ | |||
+ | |||
+ | Putting it all together, we will have a clean list of text tokens for analysis: | ||
+ | |||
+ | for i in docs: | ||
+ | raw = i.lower() | ||
+ | tokens = tokenizer.tokenize(raw) | ||
+ | stopped_tokens = [i for i in tokens if not i in en_stop] | ||
+ | stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens] | ||
+ | texts.append(stemmed_tokens) | ||
+ | |||
+ | |||
</font> | </font> | ||
− | |||
</div> | </div> | ||
<br/> | <br/> | ||
− | + | ||
− | <div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Exploratory Data Analysis</font></div> | + | ==<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">Exploratory Data Analysis</font></div>== |
− | + | ||
− | + | <div style="width:96.5%;"> | |
− | <div style=" width: | ||
<font> | <font> | ||
− | + | Refer to Report for Analysis | |
</font> | </font> | ||
</div> | </div> | ||
− | + | ||
<br/> | <br/> | ||
− | + | ==<div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96">LDA Model</font></div>== | |
− | <div style=" width: 85%; background: #E6EDFA; padding: 12px; font-family: Arimo; font-size: 18px; font-weight: bold; line-height: 1em; text-indent: 15px; border-left: #8c8d94 solid 32px;"><font color="#5A6B96"> | + | |
− | + | <div style="width:96.5%;"> | |
− | |||
− | <div style=" width: | ||
<font> | <font> | ||
− | + | Refer to Report for Model Evaluation | |
</font> | </font> | ||
− | |||
</div> | </div> | ||
<br/> | <br/> |
Latest revision as of 13:42, 10 April 2018
HOME | ABOUT US | PROJECT OVERVIEW | EDA | BUSINESS OBJECTIVES | PROJECT MANAGEMENT | DOCUMENTATION | MAIN PAGE |
Facebook Post | Facebook Video | Youtube | LDA - Blog Post |
---|
Data Source
To retrieve data from the company's posts, we used Scrapy, a fast and powerful open-sourced web-scraper to extract data from the blog. We collected data from the beginning of the first blog post, with the following information:
- Timestamp
- Author(s)
- Headline
- Category
- URL
- Tags
In order to analyse data from multiple sources (FB, Youtube, Instagram, Blog) - we need to to first join them together. One of the methods that we will be using is Latent Dirichlet Allocation (LDA), which is a natural language processing statistical model that allows us to create “topics” from a set of documents.
Reference: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html
We will be using the following Python packages: pandas, nltk, stop_words and gensim.
Data Preparation
Data Cleaning
Tokenization
Tokenizing allows us to split sentences into words. The following Regex will match any characters other than a non-word.
from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+')
Stop Words Certains words of a speech such as “for”, “as”, “the” are meaningless to generating topics, and as such, will be removed from the token list. We will be using the Python package, stop_words, which contains a predefined list of stop words.
from stop_words import get_stop_words en_stop = get_stop_words('en')
Stemming Another NLP technique, stemming, allows us to reduce words that are similar such as “eating”, “eat”, “ate” to the stem “eat”. This is important to reduce the importance in the model. There are several algorithms but we will use the most commonly implemented version: Porter’s Stemming Algorithm.
from nltk.stem.porter import PorterStemmer p_stemmer = PorterStemmer()
Putting it all together, we will have a clean list of text tokens for analysis:
for i in docs: raw = i.lower() tokens = tokenizer.tokenize(raw) stopped_tokens = [i for i in tokens if not i in en_stop] stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens] texts.append(stemmed_tokens)
Exploratory Data Analysis
Refer to Report for Analysis
LDA Model
Refer to Report for Model Evaluation