ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology

From Analytics Practicum
Revision as of 15:06, 29 December 2016 by Jonathanlow.2013 (talk | contribs)
Jump to navigation Jump to search


HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

Background Data Source Methodology


Tools Used

Programming Language

  • Python

Programming Libraries

  • Python Natural Language Toolkit (NLTK)
  • Scikit-learn
  • TensorFlow
  • pandas

IDE

  • Jupyter Notebook

Methodology

Data Collection

We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use to do our initial analysis. Subsequently, we will build a program to scrape more data from Jobsbank.gov.sg. Apart from collecting job posting data, we will also use the program to scrape the web for words that are relating to job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in the subsequent steps.

Data Preparation

The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and texts and breaking them down into meaningful words and phrases.

Exploratory Data Analysis (EDA)

After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words. Some of the analysis that we will look at are frequencies of words and words that can be use to describe skills. This process of EDA will be useful in the next step when we have to select good features for skills.

Generating Labels to Train the Classifier

With the tokens generated previously, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from ‘Data Collection’ process. The features will help to label true or false for each token in the job description.

Generating Labels to Train the Classifier

We will utilize Bayes, Bayes Linear and explore the use of other models like Neural Networks and use our labelled data to train the models.

Bayes

Bayes Linear

Neural Networks

Model Validation

For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.

Cluster Analysis