Difference between revisions of "ANLY482 AY2017-18T2 Group19 Methodology"

From Analytics Practicum
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<!--Logo-->
 
<!--Logo-->
[[Image:G19_logo.png|center|748x116px]]<br>
+
[[Image:G19_Logo.png|center|800x150px|link=ANLY482 AY2017-18T2 Group19]]<br>
  
 
<!--Navbar Start-->
 
<!--Navbar Start-->
Line 41: Line 41:
  
 
| style="vertical-align:top;width:25%;" | <div style="padding: 3px; font-weight: bold; text-align:center; line-height: wrap_content; font-size:14px; border: 0px solid; font-family:helvetica">
 
| style="vertical-align:top;width:25%;" | <div style="padding: 3px; font-weight: bold; text-align:center; line-height: wrap_content; font-size:14px; border: 0px solid; font-family:helvetica">
[[ANLY482 AY2017-18T2 Group19 Background| <font face='Century Gothic' color="#FFFFFF"><b>BACKGROUND</b></font>]]
+
[[ANLY482 AY2017-18T2 Group19 Project Overview| <font face='Century Gothic' color="#FFFFFF"><b>BACKGROUND</b></font>]]
  
 
| style="vertical-align:top;width:25%;" | <div style="padding: 3px; font-weight: bold; text-align:center; line-height: wrap_content; font-size:14px; border: 0px solid; font-family:helvetica">
 
| style="vertical-align:top;width:25%;" | <div style="padding: 3px; font-weight: bold; text-align:center; line-height: wrap_content; font-size:14px; border: 0px solid; font-family:helvetica">
Line 53: Line 53:
  
  
<div style="background: #FFFFFF; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px;letter-spacing:-0.03em;font-size:16px;id:UT1"><font face='Century Gothic' color=#000000 ><u>MODEL PLANNING</u></font></div>
+
<div style="background: #FFFFFF; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px;letter-spacing:-0.03em;font-size:16px;id:UT1"><font face='Century Gothic' color=#000000 ><u>DATA COLLECTION</u></font></div>
  
<i>Problem Definition</i>
+
SMU libraries provided us with the datasets that were extracted from their system. Figure 1 shows the details the fields that were provided for each dataset.
  
Firstly, the problem will be defined. Our client has previously worked on this problem and investigated a multi-period Home Health Care Delivery Problem (HHCDP) under stochastic service and travel times. HHCDP can be classified as a workforce scheduling and routing problem, and essentially an extension of an Orienteering Problem (OP) which involves coming up with an optimal organization of tasks for each worker. This delegation of tasks dictates the deployment of particular personnels to specific locations at specific timings.
+
[[Image:G19_Datasets.png|center|1000x300px]]
  
 +
As can be seen in the figure above, the transaction records are obtained from 2 different time periods: 12-month worth of data from year 2016 and 12-month worth of data from year 2017. In the 2016 dataset, loan policies are 2-hour and 3-day long while in the 2017 dataset, the loan policies are 3-hour and 3-day long. The transaction data amounts to 48,832 records in total while the master data has 528 records.
  
<i>Generation of Model Objective and Constraints </i>
+
An informal primary research was also conducted. Through this, it was found that there were 2 distinct library user profiles. Should the undergraduate students find the loan policy insufficient, they would act in the following 2 ways:
  
Secondly, a model will then be constructed based on the problem description previously defined. In tackling this problem, we will have to define the decision variables, objective function and constraints. There have been previous attempts at solving this problem, or variants of this problem, as listed below:
+
#They will overdue the books past the time the book is due and will return it only when they are done with it at a later time. The duration of the loan policy would be considered insufficient in this case as the users are unable to finish the usage of the books within the loan period.
  
*Mota et al. solved a Team Orienteering Problem with Time Windows (TOPTW) aimed to maximize throughput while being constrained by only being able to arrive at a particular node within the starting and ending time windows established.  
+
#They will borrow in succession. This group of users may borrow the same book title from the course reserves collection immediately after returning it. The duration of the loan policy would be considered insufficient in this case as the users are unable to finish the usage of the books within a single loan.
*Rasmussen et al. solved it as a Vehicle Routing Problem with Time Windows (VRPTW) which aims to maximize the demand that is satisfied while being constrained by the resource’s capacity and the visiting time windows.
 
*Yuan and Fugenschuh looked to minimizing total cost and total working time, whilst ensuring that it does not compromise on service quality.
 
  
 +
This observation will be taken into account when cleaning and preparing the data for analysis.
  
Taking these previous works into account, we hence propose our own model. Ultimately, our team aims to provide a model that would be practical and beneficial for a typical firm operating in the healthcare industry. In such a service-oriented industry, it is tacit knowledge that customer satisfaction is indispensable. In addition, while having to operate in a country constantly facing the problem of labor crunch, it is essential that each resource obtained be utilized efficiently.  We reflect these concerns in our model’s objective function, which is to maximise both patients’ satisfaction and the utilization of resources available in our model. If the patient has been assigned a nurse, their satisfaction will be a factor of a multitude of elements including their preference on the nurse assigned to attend to their needs and the appointment time slot assigned. The utilization of nurses will be measured based on the average labor utilization formula (labor content/(labor content + direct idle time)). Taking into account the fact that overworking nurses and thus achieving high labor utilization rates would be at the expense of patients’ satisfaction, we will cap their utilization rates to 85%. Our model will attempt to illustrate real world constraints including time windows, transportation modality, start-end locations, and skills and qualifications of the staff deployed. These constraints would ensure that the model emulate situations most befitting and applicable to the real world, thereby establishing its relevance.
 
  
 +
<div style="background: #FFFFFF; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px;letter-spacing:-0.03em;font-size:16px;id:UT1"><font face='Century Gothic' color=#000000 ><u>DATA CLEANING AND PREPARATION</u></font></div>
  
<div style="background: #FFFFFF; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px;letter-spacing:-0.03em;font-size:16px;id:UT1"><font face='Century Gothic' color=#000000 ><u>MODEL BUILDING</u></font></div>
+
Data cleaning and preparation involves:  
  
The third step involves solving the model and finding possible solutions. In terms of technologies, we will be utilizing JMP and SAS to deal with the input data, IBM CPLEX and Python to build the optimization model and final visualization.  In our final visualization, we aim to build a dashboard that displays the nurse utilization and information on the route churned out by the algorithm. We propose building our dashboard in the following format:
+
#Removal of missing data values and outliers
 +
#Standardize duplicates
 +
#Redefinition of scope to targeted groups
 +
#Addition of calculated variables
  
  
<div style="background: #FFFFFF; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px;letter-spacing:-0.03em;font-size:16px;id:UT1"><font face='Century Gothic' color=#000000 ><u>MODEL EVALUATION</u></font></div>
+
<div style="background: #FFFFFF; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px;letter-spacing:-0.03em;font-size:16px;id:UT1"><font face='Century Gothic' color=#000000 ><u>ANALYSIS AND TESTING</u></font></div>
  
Lastly, post-solution analysis will then be conducted. Here, a list of performance measures will be developed in order to determine the value of the system generated. While maximizing patients’ satisfaction levels, it is also essential to the firm that they ensure maximum utilization and efficiency of their available resources.
+
The 2 user profiles (i.e. users who overdue and users who borrow in succession) will be analysed separately. Whether or not the loan policy is sufficient for these 2 groups of users will be analyzed through measures of sufficiency. For users who overdue, their sufficiency level across 2016 and 2017 will be measured through the frequency of overdue transactions and the distribution of overdue period. For users who borrow in succession, their sufficiency level across 2016 and 2017 will be measured through the frequency of succession borrow and the distribution of hours borrowed with succession.
 +
 
 +
To confirm if there is statistical significance in the difference in frequencies observed in 2016 and 2017, contingency analyses will be performed due to the data's nominal nature. Fisher's exact test will be conducted, when appropriate.
 +
 
 +
To confirm if there is statistical significance in the difference in the distributions across the years, a means or median test would be conducted. The choice depends on whether the continuous datasets follows a normal distribution. A goodness-of-fit test will be conducted for this purpose. If the dataset follows a normal distribution, Tukey Kramer test will be used. Otherwise, Wilcoxon Signed Rank test will be used.

Latest revision as of 21:34, 15 April 2018

G19 Logo.png


G19 Home.png   HOME

 

G19 Overview Icon.png   PROJECT OVERVIEW

 

G19 Findings Icon.png   PROJECT FINDINGS

 

G19 Management Icon.png   PROJECT MANAGEMENT

 

G19 Documentation Icon.png   DOCUMENTATION

 

G19 To Main Page icon.png   BACK TO MAIN PAGE


 


DATA COLLECTION

SMU libraries provided us with the datasets that were extracted from their system. Figure 1 shows the details the fields that were provided for each dataset.

G19 Datasets.png

As can be seen in the figure above, the transaction records are obtained from 2 different time periods: 12-month worth of data from year 2016 and 12-month worth of data from year 2017. In the 2016 dataset, loan policies are 2-hour and 3-day long while in the 2017 dataset, the loan policies are 3-hour and 3-day long. The transaction data amounts to 48,832 records in total while the master data has 528 records.

An informal primary research was also conducted. Through this, it was found that there were 2 distinct library user profiles. Should the undergraduate students find the loan policy insufficient, they would act in the following 2 ways:

  1. They will overdue the books past the time the book is due and will return it only when they are done with it at a later time. The duration of the loan policy would be considered insufficient in this case as the users are unable to finish the usage of the books within the loan period.
  1. They will borrow in succession. This group of users may borrow the same book title from the course reserves collection immediately after returning it. The duration of the loan policy would be considered insufficient in this case as the users are unable to finish the usage of the books within a single loan.

This observation will be taken into account when cleaning and preparing the data for analysis.


DATA CLEANING AND PREPARATION

Data cleaning and preparation involves:

  1. Removal of missing data values and outliers
  2. Standardize duplicates
  3. Redefinition of scope to targeted groups
  4. Addition of calculated variables


ANALYSIS AND TESTING

The 2 user profiles (i.e. users who overdue and users who borrow in succession) will be analysed separately. Whether or not the loan policy is sufficient for these 2 groups of users will be analyzed through measures of sufficiency. For users who overdue, their sufficiency level across 2016 and 2017 will be measured through the frequency of overdue transactions and the distribution of overdue period. For users who borrow in succession, their sufficiency level across 2016 and 2017 will be measured through the frequency of succession borrow and the distribution of hours borrowed with succession.

To confirm if there is statistical significance in the difference in frequencies observed in 2016 and 2017, contingency analyses will be performed due to the data's nominal nature. Fisher's exact test will be conducted, when appropriate.

To confirm if there is statistical significance in the difference in the distributions across the years, a means or median test would be conducted. The choice depends on whether the continuous datasets follows a normal distribution. A goodness-of-fit test will be conducted for this purpose. If the dataset follows a normal distribution, Tukey Kramer test will be used. Otherwise, Wilcoxon Signed Rank test will be used.