Difference between revisions of "ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology"

From Analytics Practicum
Jump to navigation Jump to search
Line 66: Line 66:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Collection</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Collection</strong><br></font>
 
|}
 
|}
We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use to do our initial analysis. Subsequently, we will build a program to scrape more data from Jobsbank.gov.sg.
+
We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use for our initial analysis. Subsequently, we will build a program to scrape for more data from Jobsbank.gov.sg.<br>
Apart from collecting job posting data, we will also use the program to scrape the web for words that are relating to job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in the subsequent steps.  
+
Apart from collecting job posting data, we will also use the program to scrape the web for words that are relates to a job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in subsequent steps.  
 
</div>
 
</div>
  
Line 75: Line 75:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Preparation</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Data Preparation</strong><br></font>
 
|}
 
|}
The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and texts and breaking them down into meaningful words and phrases.
+
The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and html codes and breaking them down into meaningful words. In addition, we will perform word normalization to each word to its base form.
 
</div>
 
</div>
  
Line 83: Line 83:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Exploratory Data Analysis (EDA)</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Exploratory Data Analysis (EDA)</strong><br></font>
 
|}
 
|}
After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words.  
+
After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words. <br>
Some of the analysis that we will look at are frequencies of words and words that can be use to describe skills. This process of EDA will be useful in the next step when we have to select good features for skills.
+
Some of the analysis that we will look at are frequencies of words and words that can be use to describe the skills. We will use this information to expand the whitelist that we have previously obtained. This process of EDA will be useful in the next step to identify good features for skills.
 
</div>
 
</div>
  
Line 92: Line 92:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
 
|}
 
|}
With the tokens generated previously, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from ‘Data Collection’ process. The features will help to label true or false for each token in the job description.
+
With the tokens generated previously and the EDA conducted, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from the data collection and EDA process. These features will help to label true or false for each token in the job description.
 
</div>
 
</div>
  
Line 100: Line 100:
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
 
| style="padding:0.3em; font-family:helvetica; font-size:100%; border-bottom:2px solid #626262; border-left:2px #66FF99; text-align:left;" width="20%" | <font color="#000000" size="3em"><strong>Generating Labels to Train the Classifier</strong><br></font>
 
|}
 
|}
We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.<br><br>
+
We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.We will look into utilizing algorithms including Rule-based, Support Vector Machine and Neural Networks to train our models using the labelled data.<br><br>
'''Rule Based'''
+
'''Rule Based'''<br>
<br>
+
The Rule-based classifier makes use of a set of IF-THEN rules for classification. Using the whitelist that we obtained, we will come out with rule antecedents and its corresponding rule consequents. Given the condition holds true for a given tuple, the antecedent will be satisfied.
<br>
+
<br><br>
  
'''Support Vector Machines'''
+
'''Support Vector Machines'''<br>
 +
The Support Vector Machine (SVM) is a supervised learning model with associated learning algorithms that analyse data and recognize patterns used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories; an SVM training algorithm builds a model that assigns new examples into one category or the other. Given our training dataset, an SVM training algorithm will build a model that assigns new examples into one category or the other based on our whitelist.
 
<br>
 
<br>
 
<br>
 
<br>
  
'''Neural Networks'''
+
'''Neural Networks'''<br>
 +
Neural Networks is a type of supervised learning. It contains several components, called artificial neurons (AN), which are interconnected. Each AN is a logistic regression unit which accepts a number of inputs and outputs the class it believes the item belongs to based on our whitelist.
 
<br>
 
<br>
 
<br>
 
<br>

Revision as of 15:56, 29 December 2016


HOME

 

PROJECT OVERVIEW

 

FINDINGS

 

PROJECT DOCUMENTATION

 

PROJECT MANAGEMENT

Background Data Source Methodology


Tools Used

Programming Language

  • Python

Programming Libraries

  • Python Natural Language Toolkit (NLTK)
  • Scikit-learn
  • TensorFlow
  • pandas

IDE

  • Jupyter Notebook

Methodology

Data Collection

We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use for our initial analysis. Subsequently, we will build a program to scrape for more data from Jobsbank.gov.sg.
Apart from collecting job posting data, we will also use the program to scrape the web for words that are relates to a job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in subsequent steps.

Data Preparation

The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and html codes and breaking them down into meaningful words. In addition, we will perform word normalization to each word to its base form.

Exploratory Data Analysis (EDA)

After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words.
Some of the analysis that we will look at are frequencies of words and words that can be use to describe the skills. We will use this information to expand the whitelist that we have previously obtained. This process of EDA will be useful in the next step to identify good features for skills.

Generating Labels to Train the Classifier

With the tokens generated previously and the EDA conducted, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from the data collection and EDA process. These features will help to label true or false for each token in the job description.

Generating Labels to Train the Classifier

We will utilize Rule Based, Support Vector Machines, Neural Networks and explore the use of other models for training.We will look into utilizing algorithms including Rule-based, Support Vector Machine and Neural Networks to train our models using the labelled data.

Rule Based
The Rule-based classifier makes use of a set of IF-THEN rules for classification. Using the whitelist that we obtained, we will come out with rule antecedents and its corresponding rule consequents. Given the condition holds true for a given tuple, the antecedent will be satisfied.

Support Vector Machines
The Support Vector Machine (SVM) is a supervised learning model with associated learning algorithms that analyse data and recognize patterns used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. Given a set of training examples, each marked as belonging to one of two categories; an SVM training algorithm builds a model that assigns new examples into one category or the other. Given our training dataset, an SVM training algorithm will build a model that assigns new examples into one category or the other based on our whitelist.

Neural Networks
Neural Networks is a type of supervised learning. It contains several components, called artificial neurons (AN), which are interconnected. Each AN is a logistic regression unit which accepts a number of inputs and outputs the class it believes the item belongs to based on our whitelist.

Model Validation

For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.