APA Project Management
Defining Target Labels
Work Network contains data for each employee relationship in the company that depicts the strength of their work-based relationship. To add business value, the following rules are applied to bin the target variable:
Survey Values | Target Labels | Count | Percentage |
0,1 | No Relation | 137 | 58.33% |
1.5, 2, 2.5 | Weak Relation | 490 | 16.67% |
3, 3.5, 4 | Moderate Relation | 73 | 16.31% |
4.5, 5 | Strong Relation | 140 | 8.69% |
Many employees while filling the survey treated 1 as the minimum value (instead of blank/ no response). Therefore, both 0 and 1 are considered no relation. Since, 4 bins are defined, if the classification model can predict these categories well, email data will be a strong representation of work network.
Table above shows the distribution of the target label. The results are very skewed with 58.33% of the data instances belonging to ‘No Relation’ category. Since the data is skewed, a validation column is created to create test and train data using the stratified sampling technique. This ensures that the distribution of the target label category remains same in both target and test data points.
Prediction Screening
Using SAS JMP’s in-built predictor screening, it is observed that some features are stronger predictors of the target variable compared to other calculated features. This observation is important for the application of penalty type under model fitting for classification algorithms such as Neural Network.
Naive Bayes Algorithm
With a misclassification rate of 37.3% and 34.9% on training and validation data, the algorithm seems moderately representative. It can be observed from the confusion matrix that the weak and moderate relations have very bad prediction scores. Most of the data points in these categories have been predicted as No Relation. Many Strong Relation have been predicted as Moderate Relations as well. Therefore, this is an unsuitable algorithm for the dataset.
K-Nearest Neighbors
Like Naïve Bayes, KNN has created a moderate strength model with K=9. However, most of weak and moderate relations are still predicted as No Relation. Therefore, this is an unsuitable algorithm for the dataset.
Neural Networks
For the Hidden Layer Settings, the algorithm with 3 different settings by changing the first hidden layer inputs as 3 TanH, 3 Linear and 3 Gaussian respectively. Penalty Method ‘Absolute’ is chosen under the Fitting Options Section. This is considered a good option when a few features are a stronger predictor of the target compared to other features. Upon running the 3 different hidden layer settings, the following results are retrieved:
TanH
Linear
Gaussian
Neural Network (TanH) gives the best model with misclassification rates of 32.9% and 32.4% for training and validation data respectively. Even though there is slight improvement in the misclassification rate compared to other algorithms, weak and moderate relations are still predicted wrongly for most data points. Therefore, this is still not a suitable model.
Since, none of the models have resulted in significant results, it can be concluded that email data is not a strong representative for Work Network when divided into multiple relationship strength segments.
Email data still may be a strong representative for defining binary bins (strong and weak) in the work network. To test this, new target variables are defined.
Defining new Target Variables
The following rules are applied to bin the target variable:
Survey Values | Target Labels | Count | Percentage |
0,1,1.5,2,2.5 | Weak Relation | 630 | 75.00% |
3,3.5,4,4.5,5 | Strong Relation | 210 | 25.00% |
The new bins represent business value on a high level as well. It makes a differentiation between significant (>=3) and insignificant (< 3) work relationships at the company. However, unlike the previous target response, it is unable to differentiate employees with very strong and moderately strong relationships.
According to the table above, 75% of the data is categorized as weak relation. This value is within expectations as most employees share weak/ negligible relationships with each other and form only few significant work relationships relatively.
Stratified Sampling Technique is used again to divide the total dataset into Training and Test data. As Neural Network (TanH) algorithm was found to be the most successful algorithm till now, it will be used as the final model for predictions on the target label:
A strong model is resulted with misclassification rates of 15.7% and 12.8% respectively. The following is a sample of the row prediction data:
Therefore, email data is a strong representative for Work Network when divided into only two relationship strength segments.
Social Network Analysis Comparison - Actual Vs Predicted
Gephi is a social network analysis tool that can be used to easily calculate the betweenness centrality and eigenvector centrality. As explained in the exploration, these centralities are good metrics to observe local and global influence of all employees at the firm. Therefore, our predicted relationship strength should result in similar betweenness and eigenvector centrality employee scores as compared to those calculated from actual work relationship strength.
The relationships data along with actual and predicted work strength category is uploaded on Gephi to create social network graphs as shown in the figure below.
The graph can be filtered to only include ties that are ‘strong’ according to either actual or predicted data using the filter query tool. Many ties are removed upon this filter implementation as shown in the figure below.
As described in data exploration, betweenness and eigenvector centralities are good measures to differentiate between employees to measure control over and information flow and overall influence in the network. The tool can be directly to calculate undirected eigenvector and betweenness centrality scores as shown below.
The Centralities for actual and predicted work network should be similar for email data to be a good representation of the work relationships at the company. The screenshots below show the correlation scores for both the centralities.
With good correlation results of 0.823 and 0.7185, it can be further confirmed that Email Data is a good representation of Work Network when divided into only two relationship strength segments.