Difference between revisions of "ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology"

From Analytics Practicum
Jump to navigation Jump to search
Line 100: Line 100:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
 
|}
 
|}
We will utilize Bayes, Bayes Linear and explore the use of other models like Neural Networks and use our labelled data to train the models. <br><br>
+
We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.<br><br>
'''Bayes'''
+
'''Rule Based'''
 
<br>
 
<br>
 
<br>
 
<br>
  
'''Bayes Linear'''
+
'''Support Vector Machines'''
 
<br>
 
<br>
 
<br>
 
<br>
Line 121: Line 121:
 
|}
 
|}
 
For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.
 
For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.
</div>
 
 
<!--Cluster Analysis Content-->
 
<div style="margin:20px; padding: 10px; background: #ffffff; text-align:left; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
{| color:#E6CCFF padding: 1px 0 0 0;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Cluster Analysis</strong><br></font>
 
|}
 
 
<!--Cluster Analysis Content Here-->
 
<br>
 
 
</div>
 
</div>

Revision as of 15:39, 29 December 2016


HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

Background Data Source Methodology


Tools Used

Programming Language

  • Python

Programming Libraries

  • Python Natural Language Toolkit (NLTK)
  • Scikit-learn
  • TensorFlow
  • pandas

IDE

  • Jupyter Notebook

Methodology

Data Collection

We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use to do our initial analysis. Subsequently, we will build a program to scrape more data from Jobsbank.gov.sg. Apart from collecting job posting data, we will also use the program to scrape the web for words that are relating to job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in the subsequent steps.

Data Preparation

The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and texts and breaking them down into meaningful words and phrases.

Exploratory Data Analysis (EDA)

After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words. Some of the analysis that we will look at are frequencies of words and words that can be use to describe skills. This process of EDA will be useful in the next step when we have to select good features for skills.

Generating Labels to Train the Classifier

With the tokens generated previously, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from ‘Data Collection’ process. The features will help to label true or false for each token in the job description.

Generating Labels to Train the Classifier

We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.

Rule Based

Support Vector Machines

Neural Networks

Model Validation

For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.