Difference between revisions of "Group14 proposal"

From ISSS608-Visual Analytics and Applications
Jump to navigation Jump to search
 
(53 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[File:customer_churn.jpg|center|500px|Home - PicSource: https://medium.com/@timenalls/how-to-predict-customer-churn-with-pyspark-fb0d30f55253]]
+
[[File:HeaderPictureTeam14.jpg|center|800px]]
 
<div>
 
<div>
 
{|style="background-color:#000000;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
 
{|style="background-color:#000000;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#000000; text-align:center;" width="25%" |  
+
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#A3ACBF; text-align:center;" width="20%" |  
 
;
 
;
[[Group14_Proposal| <font color="##FFFFFF">Proposal</font>]]
+
[[Group14_Proposal| <font color="white">Proposal</font>]]
 
+
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#A3ACBF; text-align:center;" width="20%" |
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#000000; text-align:center;" width="25%" |  
+
;
 +
[[Group14_Poster| <font color="white">Poster</font>]]
 +
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#A3ACBF; text-align:center;" width="20%" |  
 
;
 
;
[[Group14_Poster| <font color="#FFFFFF">Poster</font>]]
+
[[Group14_User_Guide &_Application| <font color="white">User Guide & Application</font>]]
 
+
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#A3ACBF; text-align:center;" width="20%" |  
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#000000; text-align:center;" width="25%" |  
 
 
;
 
;
[[Group14_Application| <font color="#FFFFFF">Application</font>]]
+
[[Group14_Research_paper| <font color="white">Research Paper</font>]]
 +
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#A3ACBF; text-align:center;" width="20%" |
  
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#000000; text-align:center;" width="25%" |  
+
[[Project Groups| <font color="white">Back to Main Page</font>]]
;
+
| style="font-family:Gungsuh; font-size:100%; solid #FFFFFF; background:#A3ACBF; text-align:center;" width="20%" |  
[[Group14_Research_paper| <font color="#FFFFFF">Research Paper</font>]]
 
  
 
|}
 
|}
Line 23: Line 24:
  
 
== <big>Motivation and Objectives</big> ==
 
== <big>Motivation and Objectives</big> ==
<p> Nowadays, all industries in the world are facing fierce competition. With the development of telecom technology and social media, the Telco companies play more and more important role in the society. There are growing number of wireless carriers in the world. The U.S has four main wireless carriers and lots of little wireless carriers. It is no surprise that the companies in this industry face very fierce competition. Since this condition, the most significant problem for these organizations are customer remaining. As we know, companies from these industries often have customer service department. Their target is that winning back clients who is churn. Because it is generally acknowledged that recovering long-term customers can be worth much more to a company than acquiring new customers.
+
<p> What HRs are facing is how well they can retain their talent and better control the employee turnover. Among all the employee-related problems for businesses around the world, attrition is one of the significant problems regardless of the changes in the external working environment. Because of today’s competitive job market, high employee attrition is one of the most pressing issues businesses face. By 2023, voluntary employee turnover is expected to rise to nearly 30%. Some major impacts on employee attrition, including lack of employee continuity and possible high costs involved in the induction and training of new staff will result in issues of organizational productivity.
<p> In order to understand more directly the main factors that affect customer churn and better maintain the relationship with customers, relevant models will be built so as to select and visualize important variables. Lastly, we will present the comparison among models towards Recall, Accuracy, Precision and F1 score and evaluate the performances of different models.
+
<p> Hence, it really matters what will affect the employee and analyze what we can do to reduce as well as adjust the structure of personnel in order to achieve higher work efficiency; therefore, to improve the control of attrition. By creating different models, we are interested to see what caused employees from IBM to leave? To achieve this purpose, the executives and managers will be able to understand the current condition of the employees and take action to remedy controllable factors that can prevent attrition. Therefore, we intend to design an application which can help the Human Resource Department to further understand the structure of the employees who choose to leave and who stayed, and the attrition patterns regarding all features. From which can help not only to predict unwanted attrition, but to have proven action plans at your fingertips to help you reduce it, based on the organization's unique attributes.
 
 
== <big>Critique of Existing Visualization</big> ==
 
 
 
[[File:radar chart.png|400px|none]]
 
 
 
This radar chart has all the variables from the dataset to be presented in a single graph. We can acknowledge that by different series (“0” as “No”, “1” as “Yes”), each variable their number of customers. However, even though this visualization has revealed plenty of valuable information, such as unbalanced data in several variables with a fair enough graph, some questions can be spotted from the graph. For instance, too many variables have presented on the chart at the same time. The best way to solve this problem is to use feature engineering, then to draw a new graph to visualize better the influence of different factors to churn or not churn.
 
  
 
== <big>Data Description</big> ==
 
== <big>Data Description</big> ==
We collect the dataset from IBM Community. This dataset contains five spreadsheets.
+
<p>The data set is from IBM Community. There are 1471 entity instances in total with 30 attributes.  
They contain the information about the demographics, location, population, services and status about customers.
+
Some of the information from the data set are recorded as numbers, we need to identify them as categorical variables and will be explained in details below.
Demographic is the information about customers’ gender, age range, and if they have partners and dependents.
 
Location is the information about customers’ detail location such as country, city.
 
Status is the information about customers’ status of churn and the reason about churn.
 
There are 7043 entity instances in the dataset.
 
Each customer is identified by Customer_ID column.
 
There are 42 columns with 40 attributes.  
 
Customers who left within the last month is the column named Churn_Value. The churn customers are recorded as 1 and the non-churn customers are recorded as 0.
 
  
{| class="wikitable" style="width: 100%; height: 14em;"
+
{| class="wikitable" style="width: 100%; height: 50em;"
 
|-
 
|-
! Data Fields !! Description !! Example !! Datatype
+
! Data Fields !! Description !! Datatype
 
|-
 
|-
| Customer ID || Customer ID || 7590-VHVEG || Numeric
+
| Attrition || Turnover status of employee, stored as "Yes" or "No" ||  Binary
 
|-
 
|-
| gender || Whether the customer is a male or a female || Female || Binary
+
| BusinessTravel || Whether the employee travel frequently or not || Categorical
 
|-
 
|-
| SeniorCitizen || Whether the customer is a senior citizen or not (1, 0) || 0 || Binary
+
| DailyRate || The daily rate of employee || Numerical
 
|-
 
|-
| Partner || Whether the customer has a partner or not (Yes, No)  || Yes || Binary
+
| Education || Education level of employee, 1 as 'Below College', 2 as 'College', 3 as 'Bachelor', 4 as 'Master', 5 as 'Doctor'  || Categorical
 
|-
 
|-
| tenure || Number of months the customer has stayed with the company|| 1|| Numeric
+
| EnvironmentSatisfication || The satisfication level of one employee in working environment, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High'  || Categorical
 
|-
 
|-
| PhoneService || Whether the customer has a phone service or not (Yes, No) || No || Binary
+
| JobInvolvement  || Job involvement of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High'  || Categorical
 
|-
 
|-
| MultipleLines  || Whether the customer has multiple lines or not (Yes, No, No phone service) || No phone service || Categorical
+
| JobLevel  || Job level of employee || Numerical
 
|-
 
|-
| InternetService  || Customer’s internet service provider (DSL, Fiber optic, No)  || DSL  || Categorical
+
| JobSatisfaction || Job satisfaction of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' || Categorical
 
|-
 
|-
| OnlineSecurity || Whether the customer has online security or not (Yes, No, No internet service)  || No || Categorical
+
| NumCompaniesWorked || Total number of companies the employee worked for || Numerical
 
|-
 
|-
| OnlineBackup || Whether the customer has online backup or not (Yes, No, No internet service)  || No  || Categorical
+
| OverTime || Work overtime or not || Binary
 
|-
 
|-
| DeviceProtection ||Whether the customer has device protection or not (Yes, No, No internet service) || No || Categorical
+
| PercentSalaryHike || The rate of increase in income from last year to this year ||  Numerical
 
|-
 
|-
| TechSupport  || Whether the customer has tech support or not (Yes, No, No internet service)  || No  || Categorical
+
| PerformanceRating || Rate of performance for employee, 1 as 'Low', 2 as 'Good', 3 as 'Excellent', 4 as 'Outstanding' || Categorical
 
|-
 
|-
| StreamingTV  || Whether the customer has streaming TV or not (Yes, No, No internet service)  || No  || Categorical
+
| RelationshipSatisfaction || Relationship satisfaction of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High'|| Categorical
 
|-
 
|-
| StreamingMovies || Whether the customer has streaming movies or not (Yes, No, No internet service) || No  || Categorical
+
| StockOptionLevel  || The number of stock employee hold || Numerical
 
|-
 
|-
| Contract || The contract term of the customer (Month-to-month, One year, Two year)  || Month-to-month  || Categorical
+
| TrainingTimesLastYear || Total training time for employee last year || Numerical
|-
 
| PaperlessBilling || Whether the customer has paperless billing or not (Yes, No)  ||Yes || Binary
 
|-
 
| aymentMethod || The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))  || Electronic check || Categorical
 
|-
 
| MonthlyCharges  || The amount charged to the customer monthly || 29.85  || Numeric
 
|-
 
| TotalCharges  || The total amount charged to the customer || 29.85  || Numeric
 
|-
 
| Churn  || Whether the customer churned or not (Yes or No)  || No  || Binary
 
 
|-
 
|-
 +
| WorkLifeBalance  || Does employee feel balance between work and life, 1 as 'Bad', 2 as 'Good', 3 as 'Better', 4 as 'Best'|| Categorical
 +
 
|}
 
|}
  
== <big>Methodology and Approach</big> ==
+
== <big>Critique of Existing Visualization</big> ==
 +
 
 +
[[File:Group 14 critique pic.png|400px|none]]
 +
 
 +
<p> This radar chart above shows the performance and satisfaction level of employees in attrition or stay. It has revealed plenty of valuable information and is an excellent way to visualize data; however, only five features were evaluated consisting of ranges from 1 to 4. Though it might be easier to prepare, this chart might bring users wondering how other features compare these five. Therefore, a more precise way should be introduced to select elements constituting this chart. The best way to solve this problem is to use feature engineering, then to draw a new graph to visualize better the influence of different factors to attrition.
  
In Feature engineering, we will generate variables from the previous ones and compose multiple features together, after which we will separate churn and not churn customer and separate categorical and numerical columns. The main point of this is selecting effective variables which would result in customer attrition. Apart from that, we may transform multi value variables to the dummy variable in the last step of this stage, we can obtain the variable summary like below:
+
== <big>Visualization</big> ==
 +
<p> Basic EDA Analytics will be used to preliminarily understand the pairwise relationship between two or three features and their effects on attrition. Based on the overall relations between all features, the Network Graph will be applied to understand the structural characteristics of the features as a whole and to select features which are fed for the further model. We also plan to use three methods including Decision Tree, Random Forest and XGBoost to build the attrition analytics model. Lastly, the User Portrait Analysis will be applied to understand the differences among groups more intuitively based on the most important features that are generated from the model result. We hope our users will have a clearer understanding of the distribution of employee attrition in different dimensions and get hints of how to improve their attrition management.
  
[[File:variablesummary.jpg|400px|frame|none]]
+
=== Dashboard Sketch ===
 +
[[File:Pre-sketch_1.jpg|300px|Sketch_1]]
 +
[[File:Pre_sketch_2.jpg|300px|Sketch_2]]
 +
[[File:Pre_sketch_3.jpg|300px|Sketch_3]]
  
Secondly, a correlation matrix of this model will be visualized to present the relationship among different variables and primarily understand the influence of each variables.
+
== <big>Methodology and Approach</big> ==
 +
<p>The basic EDA Analytics is used to preliminarily understand the pairwise relationship between two or three features and their effects on attrition. Based on the overall relations between all features, the Network Graph is applied to understand the structural characteristics of the features as a whole and to select features which are fed for the further model. Three methods: Decision Tree, Random Forest and XGBoost are utilized for building the attrition analytics model and predict the probabilities of employee attrition. Lastly, the User Portrait Analysis is applied to understand the differences among groups more intuitively based on the most important features that are generated from the model result. In short, users will have a clearer understanding of the distribution of employee attrition in different dimensions and get hints of how to improve their attrition management.
  
 
== <big>Models</big> ==
 
== <big>Models</big> ==
 +
<p> Three different models will be applied for attrition analysis and their results and performances will be compared.
 +
<p> The algorithm of the decision tree model works by repeatedly partitioning the data into multiple sub-spaces so that the outcomes in each final sub-space is as homogeneous as possible. The ROC curve plots will help to display the model’s performance intuitively. At last, we can get the most importance features based on the model’s result. The more often the features are chosen to split the tree, the more important the features are.
 +
<p> Random Forest is one of the ensemble models and built by a collection of decision trees. Each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points. But as the Decision Tree, Random Forest model is easy to be overfitting. Besides, Random Forest has a serious drawback: for the attributes with different values, the attributes with more values divided will have a greater impact on the model. So, the attributes score by the random forest on such data is not trusted.
 +
XGBoost is a gradient model while random forest is a bagging model, it is an implementation of gradient boosted decision trees designed for speed and performance. It is very useful to achieve Sparsity Aware Split Finding and improve the model performance.
  
We intend to build four different models and visualize them and their performances:
+
== <big>Proposed R Packages</big> ==
(1) In Logistic regression model performance, confusion matrix and receiver operating characteristic will be visualized to evaluate the performance of the Logistic regression model. Besides, we can get the variable importance while comparing all variables by bar chart. Furthermore, obtaining the appropriate threshold for logistic regression will also be visualized to understand what is beneath the model. (2) Synthetic minority oversampling Technique will be applied to build the advanced logistic regression, and we will try to form user portrait using the most important features which are selected from the model. (3) The decision tree will be generated refer to the results of the feature score and compare the GINI coefficient to measure the degree of inequality of the distribution. And we intend to use high-score categorical features to make a ternary plot to visualize the distribution of two groups. (4) In order to control over-fitting and improve the predictive accuracy, we will also build and visualize Random Forest Classifier and compare different trees.
+
 
Lastly, we will visualize the comparison among four models towards Recall, Accuracy, Precision and F1 score and compare the performances of different models.
+
{| class="wikitable"
 +
|-
 +
! Package Name !! Description
 +
|-
 +
| shinydashboard || Enable the usage and design of shiny dashboard
 +
|-
 +
| shiny || Make the interactive web applications for data visualization
 +
|-
 +
| reshape || Give new shapes to an array without changing its data
 +
|-
 +
| plotly || Create interactive bar graphs and scatter plots
 +
|-
 +
| tidyverse || A set of packages to plot out various visualizations and EDA
 +
|-
 +
| readr || To read rectangular data
 +
|-
 +
| recharts || Create interactive radar chart
 +
|-
 +
| DT || Create data table
 +
|-
 +
| ggraph || An extension of ggplot2 to build plots layer by layer
 +
|-
 +
| corrgram || Create correlation matrix
 +
|-
 +
| ggthemes || To apply themes to Shiny applications
 +
|-
 +
| ggcorrplot || Visualize correlation matrix using ggplot2
 +
|-
 +
| plotrix || Create plot with two ordinates
 +
|}
  
 
== <big>Team Members</big> ==
 
== <big>Team Members</big> ==
Line 107: Line 129:
 
* [https://www.linkedin.com/in/yawen-shi-0046/ Shi Yawen]
 
* [https://www.linkedin.com/in/yawen-shi-0046/ Shi Yawen]
 
* [https://www.linkedin.com/in/bridgitzhu/ Zhu Keyu]
 
* [https://www.linkedin.com/in/bridgitzhu/ Zhu Keyu]
 +
 +
== <big>References</big> ==
 +
*[https://info.workinstitute.com/hubfs/2019%20Retention%20Report/Work%20Institute%202019%20Retention%20Report%20final-1.pdf#page=7 Work Institute 2019 Retention Report final]
 +
*[https://www.tinypulse.com/blog/7-common-causes-of-high-employee-turnover 7 Common (but Fixable) Causes of Employee Turnover]
 +
*[http://thecontextofthings.com/2017/01/06/employee-attrition/ Employee attrition has a “quick fix”]
 +
*[https://www.data-imaginist.com/2017/ggraph-introduction-layouts/ Introduction to ggraph: Layouts]

Latest revision as of 05:34, 27 April 2020

HeaderPictureTeam14.jpg

Proposal

Poster

User Guide & Application

Research Paper

Back to Main Page


Motivation and Objectives

What HRs are facing is how well they can retain their talent and better control the employee turnover. Among all the employee-related problems for businesses around the world, attrition is one of the significant problems regardless of the changes in the external working environment. Because of today’s competitive job market, high employee attrition is one of the most pressing issues businesses face. By 2023, voluntary employee turnover is expected to rise to nearly 30%. Some major impacts on employee attrition, including lack of employee continuity and possible high costs involved in the induction and training of new staff will result in issues of organizational productivity.

Hence, it really matters what will affect the employee and analyze what we can do to reduce as well as adjust the structure of personnel in order to achieve higher work efficiency; therefore, to improve the control of attrition. By creating different models, we are interested to see what caused employees from IBM to leave? To achieve this purpose, the executives and managers will be able to understand the current condition of the employees and take action to remedy controllable factors that can prevent attrition. Therefore, we intend to design an application which can help the Human Resource Department to further understand the structure of the employees who choose to leave and who stayed, and the attrition patterns regarding all features. From which can help not only to predict unwanted attrition, but to have proven action plans at your fingertips to help you reduce it, based on the organization's unique attributes.

Data Description

The data set is from IBM Community. There are 1471 entity instances in total with 30 attributes. Some of the information from the data set are recorded as numbers, we need to identify them as categorical variables and will be explained in details below.

Data Fields Description Datatype
Attrition Turnover status of employee, stored as "Yes" or "No" Binary
BusinessTravel Whether the employee travel frequently or not Categorical
DailyRate The daily rate of employee Numerical
Education Education level of employee, 1 as 'Below College', 2 as 'College', 3 as 'Bachelor', 4 as 'Master', 5 as 'Doctor' Categorical
EnvironmentSatisfication The satisfication level of one employee in working environment, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' Categorical
JobInvolvement Job involvement of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' Categorical
JobLevel Job level of employee Numerical
JobSatisfaction Job satisfaction of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' Categorical
NumCompaniesWorked Total number of companies the employee worked for Numerical
OverTime Work overtime or not Binary
PercentSalaryHike The rate of increase in income from last year to this year Numerical
PerformanceRating Rate of performance for employee, 1 as 'Low', 2 as 'Good', 3 as 'Excellent', 4 as 'Outstanding' Categorical
RelationshipSatisfaction Relationship satisfaction of employee, 1 as 'Low', 2 as 'Medium', 3 as 'High', 4 as 'Very High' Categorical
StockOptionLevel The number of stock employee hold Numerical
TrainingTimesLastYear Total training time for employee last year Numerical
WorkLifeBalance Does employee feel balance between work and life, 1 as 'Bad', 2 as 'Good', 3 as 'Better', 4 as 'Best' Categorical

Critique of Existing Visualization

Group 14 critique pic.png

This radar chart above shows the performance and satisfaction level of employees in attrition or stay. It has revealed plenty of valuable information and is an excellent way to visualize data; however, only five features were evaluated consisting of ranges from 1 to 4. Though it might be easier to prepare, this chart might bring users wondering how other features compare these five. Therefore, a more precise way should be introduced to select elements constituting this chart. The best way to solve this problem is to use feature engineering, then to draw a new graph to visualize better the influence of different factors to attrition.

Visualization

Basic EDA Analytics will be used to preliminarily understand the pairwise relationship between two or three features and their effects on attrition. Based on the overall relations between all features, the Network Graph will be applied to understand the structural characteristics of the features as a whole and to select features which are fed for the further model. We also plan to use three methods including Decision Tree, Random Forest and XGBoost to build the attrition analytics model. Lastly, the User Portrait Analysis will be applied to understand the differences among groups more intuitively based on the most important features that are generated from the model result. We hope our users will have a clearer understanding of the distribution of employee attrition in different dimensions and get hints of how to improve their attrition management.

Dashboard Sketch

Sketch_1 Sketch_2 Sketch_3

Methodology and Approach

The basic EDA Analytics is used to preliminarily understand the pairwise relationship between two or three features and their effects on attrition. Based on the overall relations between all features, the Network Graph is applied to understand the structural characteristics of the features as a whole and to select features which are fed for the further model. Three methods: Decision Tree, Random Forest and XGBoost are utilized for building the attrition analytics model and predict the probabilities of employee attrition. Lastly, the User Portrait Analysis is applied to understand the differences among groups more intuitively based on the most important features that are generated from the model result. In short, users will have a clearer understanding of the distribution of employee attrition in different dimensions and get hints of how to improve their attrition management.

Models

Three different models will be applied for attrition analysis and their results and performances will be compared.

The algorithm of the decision tree model works by repeatedly partitioning the data into multiple sub-spaces so that the outcomes in each final sub-space is as homogeneous as possible. The ROC curve plots will help to display the model’s performance intuitively. At last, we can get the most importance features based on the model’s result. The more often the features are chosen to split the tree, the more important the features are.

Random Forest is one of the ensemble models and built by a collection of decision trees. Each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points. But as the Decision Tree, Random Forest model is easy to be overfitting. Besides, Random Forest has a serious drawback: for the attributes with different values, the attributes with more values divided will have a greater impact on the model. So, the attributes score by the random forest on such data is not trusted. XGBoost is a gradient model while random forest is a bagging model, it is an implementation of gradient boosted decision trees designed for speed and performance. It is very useful to achieve Sparsity Aware Split Finding and improve the model performance.

Proposed R Packages

Package Name Description
shinydashboard Enable the usage and design of shiny dashboard
shiny Make the interactive web applications for data visualization
reshape Give new shapes to an array without changing its data
plotly Create interactive bar graphs and scatter plots
tidyverse A set of packages to plot out various visualizations and EDA
readr To read rectangular data
recharts Create interactive radar chart
DT Create data table
ggraph An extension of ggplot2 to build plots layer by layer
corrgram Create correlation matrix
ggthemes To apply themes to Shiny applications
ggcorrplot Visualize correlation matrix using ggplot2
plotrix Create plot with two ordinates

Team Members

References