Difference between revisions of "ANLY482 AY2016-17 T2 Group23 Silver Daisies Analysis and Findings"

From Analytics Practicum
Jump to navigation Jump to search
 
(43 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<!------- Logo ---->
 
<center>
 
[[File:THKMC.png|300px|frameless|center]]
 
</center>
 
  
 
<!------- Main Navigation Bar---->
 
<!------- Main Navigation Bar---->
Line 29: Line 25:
 
<!------- End of Main Navigation Bar---->
 
<!------- End of Main Navigation Bar---->
  
== Data Quality ==
+
== Univariate Analysis ==
Data given are in two ways. The daily operation schedule data is excel or hard copies format, which 1 day represent 1 sheet, structured in a form of tables rather than a database structure. The vehicle data are automatically collected in their system in a summary format. Therefore to get daily data will require manual selection of dates and manually extract them.
+
=== Appointment Duration ===
 +
We performed univariate analysis on appointment duration to find out the current distribution and summary statistics and obtained the following results:
 +
[[File:Duration.png|700px|frameless|none]]
 +
 
 +
* Mean duration: 1:22
 +
* Median duration: 1:12
 +
* Standard deviation: 0:47
 +
* Non-parametric distribution
 +
 
 +
== Bivariate Analysis ==
 +
We performed bivariate analysis on each independent variable to find out their individual effect on appointment duration and obtained the following main findings:
 +
 
 +
===Appointment Clinic ===
 +
The median duration differs from clinic to clinic. The longer average appointment duration includes SNEC (Singapore National Eye Center), NUH (National University Hospital), and NCC (National Cancer Centre). The shorter durations mainly come from neighborhood polyclinics, such as Geylang polyclinic, BMP (Bukit Merah Polyclinic), and CWC (Community Wellness Centre).
 +
[[File:Appt clinic.png|400px|frameless|none]]
 +
===Appointment Purpose===
 +
From the distribution analysis, we can see that Ophthalmologist takes the longest duration while Tests takes the shortest duration.
 +
[[File:newmed.png|400px|frameless|none]]
 +
Through our median test of significance, we observed that blood test (non-fasting) is significantly shorter than blood test (fasting).
 +
[[File:BT.jpg|300px|frameless|none]]
 +
 
 +
===Escort===
 +
Analyzing the appointment durations when there is Next-of-kin accompaniment vs Family (more than 1 NOK) accompaniment vs Medical escort accompaniment, we see a difference in median appointment durations. We conducted a non-parametric paired significance test and conclude that with the accompaniment of a medical escort, the median appointment duration is lower than the accompaniment of NOK or family.
 +
 
 +
[[File:Escort.PNG|400px|frameless]]
 +
 
 +
[[File:TestEscort.PNG|700px|frameless]]
 +
 
 +
==Multivariate Regression==
 +
===Stepwise Regression===
 +
As we have many independent variables and sub-variables, we will perform a stepwise multiple linear regression to select the relevant variables. The results also reveal how the partition splits are done.
 +
[[File:Boxcox.png|400px|frameless|none]]
 +
Based on Box-Cox Transformations, the y variable (appointment duration), is transformed based on the suggested lambda value of 0.263. After which it is transformed again by 0.986, before reaching a suggested lambda value of 1.0, which signifies that no transformation is needed. The Residual by Predicted Plot is now show as below.
 +
[[File:residualbef.png|400px|frameless|left]][[File:residualaft.png|400px|frameless|none]]
 +
 
  
There are discrepancies between the drivers' driving hours time log and the vehicle's moving duration as drivers change their vehicles occasionally. Also, because there is a lack of variables that directly reflect the driving cost such as the amount of fuel consumed.
+
''<center><small><small>Residuals before transformation (left) and after transformation (right)</small></small></center>
In the daily operations scheduling data, some of the vehicle numbers are missing, therefore unable to identify the vehicle used. Many of the writings for driver names and purpose of appointment are also very inconsistent and ambiguous. There are missing data in February because of file corruption which has been deleted. There are also missing data in vehicle data of 1 vehicle because of 2 months late installation of their new tracking system.
+
''
  
== Data Exploratory Analysis ==
+
===Durbin-Watson Autocorrelation Test ===
=== APPOINTMENT DEMAND ACROSS WEEKS ===
+
[[File:Autotest.PNG|400px|frameless|none]]
 +
Based on Durbin-Watson Test, there is no significant evidence to conclude that the model is autocorrelated.
  
[[File:Week_Distribution.PNG|500px|frameless|center]]
+
=== Model Fit ===
From the chart above, we can see that the starting and ending of the 3 months period has a lower number of appointments. Level 6 to 10 represents the weeks in February, 11 to 14 represents the weeks in March, 15 to 18 represents the weeks in April, and 19 to 23 represents the weeks in May.
 
  
The team believes that the lower number of appointments in week 6 and week 23 could possibly be due to the way the number of appointments is counted in jmp. As 1 Feb 2017 (start of week 6) begins in the middle of a week while 31 May 2017 (week 23) ends in the middle of a week, the lower number of appointments is most likely due to the fewer number of days taken into account.  
+
[[File:Modelfit.PNG|400px|frameless|none]]
  
Based on the above reasoning, the team will only take into consideration the data for week 7 to week 22. Thus, we could see that the average number of appointments across the 16 weeks is relatively constant, with no sharp spikes or drops. This suggests that the demand of MET service does not fluctuate across a short time span, thereby suggesting that capacity planning and utilization may not be of a concern to THK. (ie. Current fleet of capacity is likely to be able to meet demand in the next 3 months)
+
The Summary of Fit shows a R-square of 0.318, meaning that 31.8% of the relationship can be explained with the given parameters. The adjusted R-square is 0.311, which is only a little lower than R-square, showing little evidence that the model is overfitted.
  
Also, there are (2257 - 71 - 82)/16 = 131.5 appointments per week on average.  
+
[[File:Lackoffit.PNG|400px|frameless|none]]
 +
This is further supported by the high p-value in the Lack of Fit analysis. A high p-value shows that the model is close to being a saturated model. There is a 43.7% probability that this model can still be improved to a maximum R-square of 0.526. This usually means that exploration on a combination of variables is required.  
  
=== APPOINTMENT DAYS OF THE WEEK==
+
[[File:Predplot.PNG|400px|frameless|none]] Due to the nature of having many categorical variables, the points are very spread out vertically. Although the prediction is not that reliable, there is no obvious visible evidence that it is not following a linear regression model.
[[File:Days_distribution.PNG|500px|frameless|center]]
 
Based on the dataset, we could see that Mondays to Thursdays have approximately the same number of appointments; while Fridays have a fewer number of appointments on average. This could be due to the shorter operating hours on Fridays. We could also see that MET does not operate on the weekends.  
 
  
=== DISTRIBUTION BY APPOINTMENT CLINIC AND PURPOSE ===
+
=== Recursive Partitioning ===
[[File:Clinic_distribution.PNG|500px|frameless|center]]
+
Recursive partioning was performed on the independent variables purpose and clinic as there are many sub-variables. [[File:Decisiontree.PNG|200px|frameless|right]]
Based on the chart above, we can see that SGH has a significantly higher number of appointments compared to all other clinics. SGH accounts for approximately 40% of the total number of appointments.
 
  
[[File:appt_time.PNG|500px|frameless|center]]
+
Based on the above results that show the number of splits required, all the variables are recoded into ordinal variable based on their mean durations. Other independent variables can also be coded into ordinal variables based on their relationship with the response variable. The reduction of factor allows exploration of combination of x variables to produce a better predictive model.
From the graph, we can estimate that the lunch period falls between 1230 to 1330 where there are the least number of appointments. The two highest peak at occurs at 9am and 2pm, which means that the drivers should prepare to pick up PNC clients at 8:15am and 1:15pm. We could also see that 60% of appointments are held in the morning, which could suggest that the demand for MET service is greater in the morning.
 
  
=== Analysis of Vehicle Data and its behaviour ===
+
=== Interaction Analysis ===
We are currently seeking permission to release our analysis on their vehicle data.
+
A few variables were combined and permutated, adding them 1 by 1 to test whether the adjusted R-square increases.
  
== Geo-spatial Analysis ==
+
== Client Cluster Analysis ==
+
[[File:Varimp1.PNG|400px|frameless|none]]
== Time-series Analysis ==
+
Based on the variable importance analysis, Ordinal Clinic*Ordinal Purpose explains 96.6% of the relationship. In terms of practicality, the user can simply look at the LS Means Plot of this variable to get a quick gauge of the appointment duration.
 +
 +
However, based on the LS Means plot, we can see there are some levels with a rather high standard error. It will be less accurate in the prediction for level 16, 25, 32 and 40 but more accurate for other levers.
 +
[[File:Varimp2.png|400px|frameless|none]]
 +
 +
Analysing the LS Means Plot of Ordinal Escort*Ordinal Walkability allows us to visualise that only level 3 seems to have the greatest impact on duration. This level 3 can only be achieved by the combination of Family being the escort, escorting the client in a wheelchair. From this, it is evident that family escorts tend to take a longer moving time with their wheelchair bounded family member.
 +
[[File:Varimp3.png|400px|frameless|none]]

Latest revision as of 02:16, 9 April 2018

HOME

PROJECT OVERVIEW

ANALYSIS & FINDINGS

PROJECT MANAGEMENT

DOCUMENTATION

MAIN PAGE

Univariate Analysis

Appointment Duration

We performed univariate analysis on appointment duration to find out the current distribution and summary statistics and obtained the following results:

Duration.png
  • Mean duration: 1:22
  • Median duration: 1:12
  • Standard deviation: 0:47
  • Non-parametric distribution

Bivariate Analysis

We performed bivariate analysis on each independent variable to find out their individual effect on appointment duration and obtained the following main findings:

Appointment Clinic

The median duration differs from clinic to clinic. The longer average appointment duration includes SNEC (Singapore National Eye Center), NUH (National University Hospital), and NCC (National Cancer Centre). The shorter durations mainly come from neighborhood polyclinics, such as Geylang polyclinic, BMP (Bukit Merah Polyclinic), and CWC (Community Wellness Centre).

Appt clinic.png

Appointment Purpose

From the distribution analysis, we can see that Ophthalmologist takes the longest duration while Tests takes the shortest duration.

Newmed.png

Through our median test of significance, we observed that blood test (non-fasting) is significantly shorter than blood test (fasting).

BT.jpg

Escort

Analyzing the appointment durations when there is Next-of-kin accompaniment vs Family (more than 1 NOK) accompaniment vs Medical escort accompaniment, we see a difference in median appointment durations. We conducted a non-parametric paired significance test and conclude that with the accompaniment of a medical escort, the median appointment duration is lower than the accompaniment of NOK or family.

Escort.PNG

TestEscort.PNG

Multivariate Regression

Stepwise Regression

As we have many independent variables and sub-variables, we will perform a stepwise multiple linear regression to select the relevant variables. The results also reveal how the partition splits are done.

Boxcox.png

Based on Box-Cox Transformations, the y variable (appointment duration), is transformed based on the suggested lambda value of 0.263. After which it is transformed again by 0.986, before reaching a suggested lambda value of 1.0, which signifies that no transformation is needed. The Residual by Predicted Plot is now show as below.

Residualbef.png
Residualaft.png


Residuals before transformation (left) and after transformation (right)

Durbin-Watson Autocorrelation Test

Autotest.PNG

Based on Durbin-Watson Test, there is no significant evidence to conclude that the model is autocorrelated.

Model Fit

Modelfit.PNG

The Summary of Fit shows a R-square of 0.318, meaning that 31.8% of the relationship can be explained with the given parameters. The adjusted R-square is 0.311, which is only a little lower than R-square, showing little evidence that the model is overfitted.

Lackoffit.PNG

This is further supported by the high p-value in the Lack of Fit analysis. A high p-value shows that the model is close to being a saturated model. There is a 43.7% probability that this model can still be improved to a maximum R-square of 0.526. This usually means that exploration on a combination of variables is required.

Predplot.PNG

Due to the nature of having many categorical variables, the points are very spread out vertically. Although the prediction is not that reliable, there is no obvious visible evidence that it is not following a linear regression model.

Recursive Partitioning

Recursive partioning was performed on the independent variables purpose and clinic as there are many sub-variables.

Decisiontree.PNG

Based on the above results that show the number of splits required, all the variables are recoded into ordinal variable based on their mean durations. Other independent variables can also be coded into ordinal variables based on their relationship with the response variable. The reduction of factor allows exploration of combination of x variables to produce a better predictive model.

Interaction Analysis

A few variables were combined and permutated, adding them 1 by 1 to test whether the adjusted R-square increases.


Varimp1.PNG

Based on the variable importance analysis, Ordinal Clinic*Ordinal Purpose explains 96.6% of the relationship. In terms of practicality, the user can simply look at the LS Means Plot of this variable to get a quick gauge of the appointment duration.

However, based on the LS Means plot, we can see there are some levels with a rather high standard error. It will be less accurate in the prediction for level 16, 25, 32 and 40 but more accurate for other levers.

Varimp2.png

Analysing the LS Means Plot of Ordinal Escort*Ordinal Walkability allows us to visualise that only level 3 seems to have the greatest impact on duration. This level 3 can only be achieved by the combination of Family being the escort, escorting the client in a wheelchair. From this, it is evident that family escorts tend to take a longer moving time with their wheelchair bounded family member.

Varimp3.png