Difference between revisions of "Group14 proposal"
Line 71: | Line 71: | ||
== <big>Critique of Existing Visualization</big> == | == <big>Critique of Existing Visualization</big> == | ||
− | [[File: | + | [[File:Group 14 critique pic.png|300px|none]] |
This radar chart has all the variables from the dataset to be presented in a single graph. We can acknowledge that by different series (“0” as “No”, “1” as “Yes”), each variable their number of customers. However, even though this visualization has revealed plenty of valuable information, such as unbalanced data in several variables with a fair enough graph, some questions can be spotted from the graph. For instance, too many variables have presented on the chart at the same time. The best way to solve this problem is to use feature engineering, then to draw a new graph to visualize better the influence of different factors to churn or not churn. | This radar chart has all the variables from the dataset to be presented in a single graph. We can acknowledge that by different series (“0” as “No”, “1” as “Yes”), each variable their number of customers. However, even though this visualization has revealed plenty of valuable information, such as unbalanced data in several variables with a fair enough graph, some questions can be spotted from the graph. For instance, too many variables have presented on the chart at the same time. The best way to solve this problem is to use feature engineering, then to draw a new graph to visualize better the influence of different factors to churn or not churn. |
Revision as of 04:04, 26 April 2020
Contents
Motivation and Objectives
Because of today’s competitive job market, high employee attrition is one of the most pressing issues businesses face. By 2023, voluntary employee turnover is expected to rise to nearly 30%. Some major impacts on employee attrition, including lack of employee continuity and possible high costs involved in the induction and training of new staff will result in issues of organizational productivity.
Hence, by creating different models, we are interested to see what caused employees from IBM to leave? By achieving this purpose, the executives and managers will be able to understand the current condition of the employees and take action to remedy controllable factors that can prevent attrition.
We want to see what the biggest reasons for employees might be to leave the company by finding the attributes responsible for employee attrition. Once the attrition is found out, the company can have studies on the finding of the retention factors and implement those factors for each employee to improve employee retention.
Data Description
The data set is from IBM Community. There are 1471 entity instances in total with 30 attributes. Some of the information from the data set are recorded as numbers, we need to identify them as categorical variables and will be explained in details below.
Data Fields | Description | Datatype |
---|---|---|
Attrition | Turnover status of employee, stored as "Yes" or "No" | Binary |
BusinessTravel | Whether the employee travel frequently or not | Categorical |
DailyRate | The daily rate of employee | Numerical |
Education | Education level of employee, 1 as 'Below College', 2 as 'College', 3 as 'Bachelor', 4 as 'Master', 5 as 'Doctor' | Categorical |
EnvironmentSatisfication | The satisfication level of one employee in working environment, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' | Categorical |
JobInvolvement | Job involvement of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' | Categorical |
JobLevel | Job level of employee | Numerical |
JobSatisfaction | Job satisfaction of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' | Categorical |
NumCompaniesWorked | Total number of companies the employee worked for | Numerical |
OverTime | Work overtime or not | Binary |
PercentSalaryHike | The rate of increase in income from last year to this year | Numerical |
PerformanceRating | Rate of performance for employee, 1 as 'Low', 2 as 'Good', 3 as 'Excellent', 4 as 'Outstanding' | Categorical |
RelationshipSatisfaction | Relationship satisfaction of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' | Categorical |
StockOptionLevel | The number of stock employee hold | Numerical |
TrainingTimesLastYear | Total training time for employee last year | Numerical |
WorkLifeBalance | Does employee feel balance between work and life, 1 as 'Bad', 2 as 'Good', 3 as 'Better', 4 as 'Best' | Categorical |
Critique of Existing Visualization
This radar chart has all the variables from the dataset to be presented in a single graph. We can acknowledge that by different series (“0” as “No”, “1” as “Yes”), each variable their number of customers. However, even though this visualization has revealed plenty of valuable information, such as unbalanced data in several variables with a fair enough graph, some questions can be spotted from the graph. For instance, too many variables have presented on the chart at the same time. The best way to solve this problem is to use feature engineering, then to draw a new graph to visualize better the influence of different factors to churn or not churn.
Visualization
EDA
Dashboard Sketch
Methodology and Approach
In Feature engineering, we will generate variables from the previous ones and compose multiple features together, after which we will separate churn and not churn customer and separate categorical and numerical columns. The main point of this is selecting effective variables which would result in customer attrition. Apart from that, we may transform multi value variables to the dummy variable in the last step of this stage, we can obtain the variable summary like below:
Secondly, a correlation matrix of this model will be visualized to present the relationship among different variables and primarily understand the influence of each variables.
Models
We intend to build four different models and visualize them and their performances: (1) In Logistic regression model performance, confusion matrix and receiver operating characteristic will be visualized to evaluate the performance of the Logistic regression model. Besides, we can get the variable importance while comparing all variables by bar chart. Furthermore, obtaining the appropriate threshold for logistic regression will also be visualized to understand what is beneath the model. (2) Synthetic minority oversampling Technique will be applied to build the advanced logistic regression, and we will try to form user portrait using the most important features which are selected from the model. (3) The decision tree will be generated refer to the results of the feature score and compare the GINI coefficient to measure the degree of inequality of the distribution. And we intend to use high-score categorical features to make a ternary plot to visualize the distribution of two groups. (4) In order to control over-fitting and improve the predictive accuracy, we will also build and visualize Random Forest Classifier and compare different trees. Lastly, we will visualize the comparison among four models towards Recall, Accuracy, Precision and F1 score and compare the performances of different models.
Proposed R Packages
Package Name | Description |
---|---|
shiny | Make the interactive web applications for data visualization |
reshape | Give new shapes to an array without changing its data |
plotly | Create interactive bar graphs and scatter plots |
tidyverse | A set of packages to plot out various visualizations and EDA |
readr | To read rectangular data |
recharts | Create interactive radar chart |
DT | Create data table |
ggraph | An extension of ggplot2 to build plots layer by layer |
corrgram | Create correlation matrix |
ggthemes | To apply themes to Shiny applications |
ggcorrplot | Visualize correlation matrix using ggplot2 |
plotrix | Create plot with two ordinates |