ANLY482 AY2016-17 T2 Group 2 Project Overview Methodology
Background | Data Source | Methodology |
---|
Tools Used
Methodology
Data Collection |
We are provided with the job description and requirements for January 2016 from Jobsbank.gov.sg which we will use to do our initial analysis. Subsequently, we will build a program to scrape more data from Jobsbank.gov.sg. Apart from collecting job posting data, we will also use the program to scrape the web for words that are relating to job’s skills. These words will be used as a whitelist to help us in the construction of labels which are required in the subsequent steps.
Data Preparation |
The job postings text consists of commas, html tags, etc. which are irrelevant for our text analysis. Thus, we will perform the necessary techniques to tokenize the text by removing the punctuations and texts and breaking them down into meaningful words and phrases.
Exploratory Data Analysis (EDA) |
After data preparation, we will do exploratory data analysis on the words. We will use pandas, an open source data analysis python library, to assist us in visualization of the words. Some of the analysis that we will look at are frequencies of words and words that can be use to describe skills. This process of EDA will be useful in the next step when we have to select good features for skills.
Generating Labels to Train the Classifier |
With the tokens generated previously, we will attempt to detect the initial set of skills in the job postings. To do this, we select features using the whitelist that we obtained from ‘Data Collection’ process. The features will help to label true or false for each token in the job description.
Generating Labels to Train the Classifier |
We will utilize Bayes, Bayes Linear and explore the use of other models like Neural Networks and use our labelled data to train the models.
Bayes
Bayes Linear
Neural Networks
Model Validation |
For each model, we will validate it by attempting to detect the skills in the text of the test data and see how efficient is the model. Repeat from step 4 to improve the features and annotation until we get a good enough model.
Cluster Analysis |