APA Project Management
Defining Target Labels
Work Network contains data for each employee relationship in the company that depicts the strength of their work-based relationship. To add business value, the following rules are applied to bin the target variable:
Survey Values | Target Labels | Count | Percentage |
0,1 | No Relation | 137 | 58.33% |
1.5, 2, 2.5 | Weak Relation | 490 | 16.67% |
3, 3.5, 4 | Moderate Relation | 73 | 16.31% |
4.5, 5 | Strong Relation | 140 | 8.69% |
Many employees while filling the survey treated 1 as the minimum value (instead of blank/ no response). Therefore, both 0 and 1 are considered no relation. Since, 4 bins are defined, if the classification model can predict these categories well, email data will be a strong representation of work network.
Table above shows the distribution of the target label. The results are very skewed with 58.33% of the data instances belonging to ‘No Relation’ category. Since the data is skewed, a validation column is created to create test and train data using the stratified sampling technique. This ensures that the distribution of the target label category remains same in both target and test data points.
Prediction Screening
Using SAS JMP’s in-built predictor screening, it is observed that some features are stronger predictors of the target variable compared to other calculated features. This observation is important for the application of penalty type under model fitting for classification algorithms such as Neural Network.