ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology

From Analytics Practicum
Revision as of 15:56, 29 December 2016 by Jonathanlow.2013 (talk | contribs)
Jump to navigation Jump to search


HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

Background Data Source Methodology


Tools Used

Programming Language

  • Python

Programming Libraries

  • Python Natural Language Toolkit (NLTK)
  • Scikit-learn
  • TensorFlow
  • pandas

IDE

  • Jupyter Notebook

Methodology

Data Collection

We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use for our initial analysis. Subsequently, we will build a program to scrape for more data from Jobsbank.gov.sg.
Apart from collecting job posting data, we will also use the program to scrape the web for words that are relates to a job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in subsequent steps.

Data Preparation

The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and html codes and breaking them down into meaningful words. In addition, we will perform word normalization to each word to its base form.

Exploratory Data Analysis (EDA)

After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words.
Some of the analysis that we will look at are frequencies of words and words that can be use to describe the skills. We will use this information to expand the whitelist that we have previously obtained. This process of EDA will be useful in the next step to identify good features for skills.

Generating Labels to Train the Classifier

With the tokens generated previously and the EDA conducted, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from the data collection and EDA process. These features will help to label true or false for each token in the job description.

Generating Labels to Train the Classifier

We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.We will look into utilizing algorithms including Rule-based, Support Vector Machine and Neural Networks to train our models using the labelled data.

Rule Based
The Rule-based classifier makes use of a set of IF-THEN rules for classification. Using the whitelist that we obtained, we will come out with rule antecedents and its corresponding rule consequents. Given the condition holds true for a given tuple, the antecedent will be satisfied.

Support Vector Machines
The Support Vector Machine (SVM) is a supervised learning model with associated learning algorithms that analyse data and recognize patterns used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories; an SVM training algorithm builds a model that assigns new examples into one category or the other. Given our training dataset, an SVM training algorithm will build a model that assigns new examples into one category or the other based on our whitelist.

Neural Networks
Neural Networks is a type of supervised learning. It contains several components, called artificial neurons (AN), which are interconnected. Each AN is a logistic regression unit which accepts a number of inputs and outputs the class it believes the item belongs to based on our whitelist.

Model Validation

For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.