Difference between revisions of "ANLY482 AY2016-17 T2 Group23 Silver Daisies Analysis and Findings"

From Analytics Practicum
Jump to navigation Jump to search
 
(44 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<!------- Logo ---->
 
<center>
 
[[File:THKMC.png|300px|frameless|center]]
 
</center>
 
  
 
<!------- Main Navigation Bar---->
 
<!------- Main Navigation Bar---->
Line 29: Line 25:
 
<!------- End of Main Navigation Bar---->
 
<!------- End of Main Navigation Bar---->
  
== Data Quality ==
+
== Univariate Analysis ==
Data given are in two ways. The daily operation schedule data is excel or hard copies format, which 1 day represent 1 sheet, structured in a form of tables rather than a database structure. The vehicle data are automatically collected in their system in a summary format. Therefore to get daily data will require manual selection of dates and manually extract them.
+
=== Appointment Duration ===
 +
We performed univariate analysis on appointment duration to find out the current distribution and summary statistics and obtained the following results:
 +
[[File:Duration.png|700px|frameless|none]]
 +
 
 +
* Mean duration: 1:22
 +
* Median duration: 1:12
 +
* Standard deviation: 0:47
 +
* Non-parametric distribution
 +
 
 +
== Bivariate Analysis ==
 +
We performed bivariate analysis on each independent variable to find out their individual effect on appointment duration and obtained the following main findings:
 +
 
 +
===Appointment Clinic ===
 +
The median duration differs from clinic to clinic. The longer average appointment duration includes SNEC (Singapore National Eye Center), NUH (National University Hospital), and NCC (National Cancer Centre). The shorter durations mainly come from neighborhood polyclinics, such as Geylang polyclinic, BMP (Bukit Merah Polyclinic), and CWC (Community Wellness Centre).
 +
[[File:Appt clinic.png|400px|frameless|none]]
 +
===Appointment Purpose===
 +
From the distribution analysis, we can see that Ophthalmologist takes the longest duration while Tests takes the shortest duration.
 +
[[File:newmed.png|400px|frameless|none]]
 +
Through our median test of significance, we observed that blood test (non-fasting) is significantly shorter than blood test (fasting).
 +
[[File:BT.jpg|300px|frameless|none]]
 +
 
 +
===Escort===
 +
Analyzing the appointment durations when there is Next-of-kin accompaniment vs Family (more than 1 NOK) accompaniment vs Medical escort accompaniment, we see a difference in median appointment durations. We conducted a non-parametric paired significance test and conclude that with the accompaniment of a medical escort, the median appointment duration is lower than the accompaniment of NOK or family.
 +
 
 +
[[File:Escort.PNG|400px|frameless]]
 +
 
 +
[[File:TestEscort.PNG|700px|frameless]]
 +
 
 +
==Multivariate Regression==
 +
===Stepwise Regression===
 +
As we have many independent variables and sub-variables, we will perform a stepwise multiple linear regression to select the relevant variables. The results also reveal how the partition splits are done.
 +
[[File:Boxcox.png|400px|frameless|none]]
 +
Based on Box-Cox Transformations, the y variable (appointment duration), is transformed based on the suggested lambda value of 0.263. After which it is transformed again by 0.986, before reaching a suggested lambda value of 1.0, which signifies that no transformation is needed. The Residual by Predicted Plot is now show as below.
 +
[[File:residualbef.png|400px|frameless|left]][[File:residualaft.png|400px|frameless|none]]
 +
 
 +
 
 +
''<center><small><small>Residuals before transformation (left) and after transformation (right)</small></small></center>
 +
''
 +
 
 +
===Durbin-Watson Autocorrelation Test ===
 +
[[File:Autotest.PNG|400px|frameless|none]]
 +
Based on Durbin-Watson Test, there is no significant evidence to conclude that the model is autocorrelated.
 +
 
 +
=== Model Fit ===
 +
 
 +
[[File:Modelfit.PNG|400px|frameless|none]]
 +
 
 +
The Summary of Fit shows a R-square of 0.318, meaning that 31.8% of the relationship can be explained with the given parameters. The adjusted R-square is 0.311, which is only a little lower than R-square, showing little evidence that the model is overfitted.
 +
 
 +
[[File:Lackoffit.PNG|400px|frameless|none]]
 +
This is further supported by the high p-value in the Lack of Fit analysis. A high p-value shows that the model is close to being a saturated model. There is a 43.7% probability that this model can still be improved to a maximum R-square of 0.526. This usually means that exploration on a combination of variables is required.
 +
 
 +
[[File:Predplot.PNG|400px|frameless|none]] Due to the nature of having many categorical variables, the points are very spread out vertically. Although the prediction is not that reliable, there is no obvious visible evidence that it is not following a linear regression model.
 +
 
 +
=== Recursive Partitioning ===
 +
Recursive partioning was performed on the independent variables purpose and clinic as there are many sub-variables. [[File:Decisiontree.PNG|200px|frameless|right]]
 +
 
 +
Based on the above results that show the number of splits required, all the variables are recoded into ordinal variable based on their mean durations. Other independent variables can also be coded into ordinal variables based on their relationship with the response variable. The reduction of factor allows exploration of combination of x variables to produce a better predictive model.
  
There are discrepancies between the drivers' driving hours time log and the vehicle's moving duration as drivers change their vehicles occasionally. Also, because there is a lack of variables that directly reflect the driving cost such as the amount of fuel consumed.
+
=== Interaction Analysis ===
In the daily operations scheduling data, some of the vehicle numbers are missing, therefore unable to identify the vehicle used. Many of the writings for driver names and purpose of appointment are also very inconsistent and ambiguous. There are missing data in February because of file corruption which has been deleted. There are also missing data in vehicle data of 1 vehicle because of 2 months late installation of their new tracking system.
+
A few variables were combined and permutated, adding them 1 by 1 to test whether the adjusted R-square increases.
  
== Manpower Ratio Analysis ==
+
== Geo-spatial Analysis ==
+
[[File:Varimp1.PNG|400px|frameless|none]]
== Client Cluster Analysis ==
+
Based on the variable importance analysis, Ordinal Clinic*Ordinal Purpose explains 96.6% of the relationship. In terms of practicality, the user can simply look at the LS Means Plot of this variable to get a quick gauge of the appointment duration.
== Time-series Analysis ==
+
 +
However, based on the LS Means plot, we can see there are some levels with a rather high standard error. It will be less accurate in the prediction for level 16, 25, 32 and 40 but more accurate for other levers.
 +
[[File:Varimp2.png|400px|frameless|none]]
 +
 +
Analysing the LS Means Plot of Ordinal Escort*Ordinal Walkability allows us to visualise that only level 3 seems to have the greatest impact on duration. This level 3 can only be achieved by the combination of Family being the escort, escorting the client in a wheelchair. From this, it is evident that family escorts tend to take a longer moving time with their wheelchair bounded family member.
 +
[[File:Varimp3.png|400px|frameless|none]]

Latest revision as of 02:16, 9 April 2018

HOME

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

MAIN PAGE

Univariate Analysis

Appointment Duration

We performed univariate analysis on appointment duration to find out the current distribution and summary statistics and obtained the following results:

Duration.png
  • Mean duration: 1:22
  • Median duration: 1:12
  • Standard deviation: 0:47
  • Non-parametric distribution

Bivariate Analysis

We performed bivariate analysis on each independent variable to find out their individual effect on appointment duration and obtained the following main findings:

Appointment Clinic

The median duration differs from clinic to clinic. The longer average appointment duration includes SNEC (Singapore National Eye Center), NUH (National University Hospital), and NCC (National Cancer Centre). The shorter durations mainly come from neighborhood polyclinics, such as Geylang polyclinic, BMP (Bukit Merah Polyclinic), and CWC (Community Wellness Centre).

Appt clinic.png

Appointment Purpose

From the distribution analysis, we can see that Ophthalmologist takes the longest duration while Tests takes the shortest duration.

Newmed.png

Through our median test of significance, we observed that blood test (non-fasting) is significantly shorter than blood test (fasting).

BT.jpg

Escort

Analyzing the appointment durations when there is Next-of-kin accompaniment vs Family (more than 1 NOK) accompaniment vs Medical escort accompaniment, we see a difference in median appointment durations. We conducted a non-parametric paired significance test and conclude that with the accompaniment of a medical escort, the median appointment duration is lower than the accompaniment of NOK or family.

Escort.PNG

TestEscort.PNG

Multivariate Regression

Stepwise Regression

As we have many independent variables and sub-variables, we will perform a stepwise multiple linear regression to select the relevant variables. The results also reveal how the partition splits are done.

Boxcox.png

Based on Box-Cox Transformations, the y variable (appointment duration), is transformed based on the suggested lambda value of 0.263. After which it is transformed again by 0.986, before reaching a suggested lambda value of 1.0, which signifies that no transformation is needed. The Residual by Predicted Plot is now show as below.

Residualbef.png
Residualaft.png


Residuals before transformation (left) and after transformation (right)

Durbin-Watson Autocorrelation Test

Autotest.PNG

Based on Durbin-Watson Test, there is no significant evidence to conclude that the model is autocorrelated.

Model Fit

Modelfit.PNG

The Summary of Fit shows a R-square of 0.318, meaning that 31.8% of the relationship can be explained with the given parameters. The adjusted R-square is 0.311, which is only a little lower than R-square, showing little evidence that the model is overfitted.

Lackoffit.PNG

This is further supported by the high p-value in the Lack of Fit analysis. A high p-value shows that the model is close to being a saturated model. There is a 43.7% probability that this model can still be improved to a maximum R-square of 0.526. This usually means that exploration on a combination of variables is required.

Predplot.PNG

Due to the nature of having many categorical variables, the points are very spread out vertically. Although the prediction is not that reliable, there is no obvious visible evidence that it is not following a linear regression model.

Recursive Partitioning

Recursive partioning was performed on the independent variables purpose and clinic as there are many sub-variables.

Decisiontree.PNG

Based on the above results that show the number of splits required, all the variables are recoded into ordinal variable based on their mean durations. Other independent variables can also be coded into ordinal variables based on their relationship with the response variable. The reduction of factor allows exploration of combination of x variables to produce a better predictive model.

Interaction Analysis

A few variables were combined and permutated, adding them 1 by 1 to test whether the adjusted R-square increases.


Varimp1.PNG

Based on the variable importance analysis, Ordinal Clinic*Ordinal Purpose explains 96.6% of the relationship. In terms of practicality, the user can simply look at the LS Means Plot of this variable to get a quick gauge of the appointment duration.

However, based on the LS Means plot, we can see there are some levels with a rather high standard error. It will be less accurate in the prediction for level 16, 25, 32 and 40 but more accurate for other levers.

Varimp2.png

Analysing the LS Means Plot of Ordinal Escort*Ordinal Walkability allows us to visualise that only level 3 seems to have the greatest impact on duration. This level 3 can only be achieved by the combination of Family being the escort, escorting the client in a wheelchair. From this, it is evident that family escorts tend to take a longer moving time with their wheelchair bounded family member.

Varimp3.png