ANLY482 AY2017-18T2 Group30 LDA-Blog Post

From Analytics Practicum
Revision as of 11:44, 26 February 2018 by Eric.yeo.2014 (talk | contribs)
Jump to navigation Jump to search
APex Logo.PNG


HOME ABOUT US PROJECT OVERVIEW EDA BUSINESS OBJECTIVES PROJECT MANAGEMENT DOCUMENTATION MAIN PAGE
Facebook Post Facebook Video Youtube Instagram LDA - Blog Post


Data Source

To retrieve data from the company's posts, we used Scrapy, a fast and powerful open-sourced web-scraper to extract data from the blog. We collected data from the beginning of the first blog post, with the following information:

  • Timestamp
  • Author(s)
  • Headline
  • Category
  • URL
  • Tags

In order to analyse data from multiple sources (FB, Youtube, Instagram, Blog) - we need to to first join them together. One of the methods that we will be using is Latent Dirichlet Allocation (LDA), which is a natural language processing statistical model that allows us to create “topics” from a set of documents.

Reference: https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

We will be using the following Python packages: pandas, nltk, stop_words and gensim.



Data Preparation

Data Cleaning

Tokenization

Tokenizing allows us to split sentences into words. The following Regex will match any characters other than a non-word.

   from nltk.tokenize import RegexpTokenizer
   tokenizer = RegexpTokenizer(r'\w+')

Stop Words Certains words of a speech such as “for”, “as”, “the” are meaningless to generating topics, and as such, will be removed from the token list. We will be using the Python package, stop_words, which contains a predefined list of stop words.

   from stop_words import get_stop_words
   en_stop = get_stop_words('en')

Stemming Another NLP technique, stemming, allows us to reduce words that are similar such as “eating”, “eat”, “ate” to the stem “eat”. This is important to reduce the importance in the model. There are several algorithms but we will use the most commonly implemented version: Porter’s Stemming Algorithm.

   from nltk.stem.porter import PorterStemmer
   p_stemmer = PorterStemmer()


Putting it all together, we will have a clean list of text tokens for analysis:

   for i in docs:
       raw = i.lower()
       tokens = tokenizer.tokenize(raw)
       stopped_tokens = [i for i in tokens if not i in en_stop]
       stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
       texts.append(stemmed_tokens)



Exploratory Data Analysis


LDA Model