APA Project Management

From Analytics Practicum
Revision as of 20:35, 23 April 2017 by Prekshaapu.2013 (talk | contribs) (added classification info)
Jump to navigation Jump to search

APA logo.png

HOME

 

PROJECT OVERVIEW

 

METHODOLOGY

 

FEATURE ENGINEERING

 

CLASSIFICATION MODELLING

 

DOCUMENTATION

 

OTHER PROJECTS

 


Defining Target Labels
Work Network contains data for each employee relationship in the company that depicts the strength of their work-based relationship. To add business value, the following rules are applied to bin the target variable:

Survey Values Target Labels Count Percentage
0,1 No Relation 137 58.33%
1.5, 2, 2.5 Weak Relation 490 16.67%
3, 3.5, 4 Moderate Relation 73 16.31%
4.5, 5 Strong Relation 140 8.69%

Many employees while filling the survey treated 1 as the minimum value (instead of blank/ no response). Therefore, both 0 and 1 are considered no relation. Since, 4 bins are defined, if the classification model can predict these categories well, email data will be a strong representation of work network. Table above shows the distribution of the target label. The results are very skewed with 58.33% of the data instances belonging to ‘No Relation’ category. Since the data is skewed, a validation column is created to create test and train data using the stratified sampling technique. This ensures that the distribution of the target label category remains same in both target and test data points.

Prediction Screening
Predscreen.png
Using SAS JMP’s in-built predictor screening, it is observed that some features are stronger predictors of the target variable compared to other calculated features. This observation is important for the application of penalty type under model fitting for classification algorithms such as Neural Network.

Naive Bayes Algorithm
Naibay.png
With a misclassification rate of 37.3% and 34.9% on training and validation data, the algorithm seems moderately representative. It can be observed from the confusion matrix that the weak and moderate relations have very bad prediction scores. Most of the data points in these categories have been predicted as No Relation. Many Strong Relation have been predicted as Moderate Relations as well. Therefore, this is an unsuitable algorithm for the dataset.

K-Nearest Neighbors
Knear.png
Like Naïve Bayes, KNN has created a moderate strength model with K=9. However, most of weak and moderate relations are still predicted as No Relation. Therefore, this is an unsuitable algorithm for the dataset.

Neural Networks
For the Hidden Layer Settings, the algorithm with 3 different settings by changing the first hidden layer inputs as 3 TanH, 3 Linear and 3 Gaussian respectively. Penalty Method ‘Absolute’ is chosen under the Fitting Options Section. This is considered a good option when a few features are a stronger predictor of the target compared to other features. Upon running the 3 different hidden layer settings, the following results are retrieved:
TanH
Neuralnet.png
Linear
Nlinearneural.png
Gaussian
Gaussianneural.png
Neural Network (TanH) gives the best model with misclassification rates of 32.9% and 32.4% for training and validation data respectively. Even though there is slight improvement in the misclassification rate compared to other algorithms, weak and moderate relations are still predicted wrongly for most data points. Therefore, this is still not a suitable model. Since, none of the models have resulted in significant results, it can be concluded that email data is not a strong representative for Work Network when divided into multiple relationship strength segments. Email data still may be a strong representative for defining binary bins (strong and weak) in the work network. To test this, new target variables are defined.

Defining new Target Variables
The following rules are applied to bin the target variable:

Survey Values Target Labels Count Percentage
0,1,1.5,2,2.5 Weak Relation 630 75.00%
3,3.5,4,4.5,5 Strong Relation 210 25.00%

The new bins represent business value on a high level as well. It makes a differentiation between significant (>=3) and insignificant (< 3) work relationships at the company. However, unlike the previous target response, it is unable to differentiate employees with very strong and moderately strong relationships. According to the table above, 75% of the data is categorized as weak relation. This value is within expectations as most employees share weak/ negligible relationships with each other and form only few significant work relationships relatively. Stratified Sampling Technique is used again to divide the total dataset into Training and Test data. As Neural Network (TanH) algorithm was found to be the most successful algorithm till now, it will be used as the final model for predictions on the target label: