Difference between revisions of "ANLY482 AY2017-18T2 Group08 : Project Findings / Final"

From Analytics Practicum
Jump to navigation Jump to search
Line 21: Line 21:
 
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: #404040; width:50%" | [[ANLY482 AY2017-18T2 Group08 : Findings / Final | <font color="#ffffff">'''Final'''</font>]]
 
| style="padding: 0.25em; font-size: 90%; border-top: 1px solid #cccccc; border-left: 1px solid #cccccc; border-right: 1px solid #cccccc; border-bottom: 1px solid #cccccc; text-align:center; background-color: #404040; width:50%" | [[ANLY482 AY2017-18T2 Group08 : Findings / Final | <font color="#ffffff">'''Final'''</font>]]
 
|}
 
|}
 
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >0.0 Abstract</font></div>==
 
Dockless bike-sharing is an increasingly common phenomenon in today’s transportation industry. Not only does it provide a cost-efficient and convenient mode of transportation in urban cities, it also helps to ease the carbon footprint by reducing reliance on traditional modes of transport such as buses, trains and cars. Unfortunately, this business model has hit a major snag – parking. Since the introduction of bike-sharing, illegal parking has been on the rise in many countries such as China, Japan and Singapore. Despite the growing prevalence of illegal bike-parking, existing research on the bike-sharing industry focuses mainly on examining business profitability and understanding bicycle route data. To fill this research gap, a practice research study has been conducted to demonstrate the use of L-function, bw.diggle and Kernel Density Estimation in analysing spatial point patterns of illegal bike-parking in the real world. 
 
 
To begin, an overview of the bike-sharing industry and research motivations will be shared. Next, a review of relevant literatures of L-function and Kernel Density Estimation will be presented. Following which, the application of these tools to a case study with a bike-sharing company in Singapore will be illustrated and last but not least, relevant insights will be documented and explained. The case study focuses on two main regions in Singapore that have a high rate of illegal parking cases, namely Bedok and Jurong-West. It was observed that indiscriminate bike-parking shows signs of significant clustering in these regions, with “hotspots” concentrated specifically at landmarks such as HDBs and MRT stations. In addition, upon further analysis, it was noticed that generally, areas with yellow boxes (i.e. designated parking areas) present have a lower intensity of illegal bike-parking. Further, time period was said to have an effect on the intensity of clusters in various landmarks across these two regions.
 
  
 
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >1.0 Introduction and Project Background</font></div>==
 
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >1.0 Introduction and Project Background</font></div>==
Line 35: Line 30:
 
'''<big><font color="#fcb706">1.1 Motivations and Objectives</font></big>'''<br>
 
'''<big><font color="#fcb706">1.1 Motivations and Objectives</font></big>'''<br>
 
Majority of existing research surrounding the bike-sharing movement consists of studies conducted with two goals in mind: <br>
 
Majority of existing research surrounding the bike-sharing movement consists of studies conducted with two goals in mind: <br>
1. understanding business profitability and sustainability concerns <br>
+
1. Understanding business profitability and sustainability concerns <br>
2. gather insights on bicycle routes taken by individuals to offer guidance to urban planners, policy makers and transportation practitioners
+
2. Gathering insights on bicycle routes taken by individuals to offer guidance to urban planners, policy makers and transportation practitioners
  
  
Line 53: Line 48:
 
Some literatures also highlighted certain inherent limitations, one of which is that KDE is unable to show the distance where spatial patterns become significant. The paper, “Identification of hazardous road locations of traffic accidents by means of KDE and cluster significance evaluation”, explored the use of KDE in determining areas with a high potential of road traffic accidents. More importantly, it also introduced the ‘Monte-Carlo Simulation’, a statistical technique that uses repeated random simulations to determine properties of event and their significance level. By combining both techniques, it allowed the researcher to identify the clusters of traffic accident that are statistically significant. Thus, to ensure the accuracy of our study, the L-function and ‘bw.diggle’, a function in R studio’s ‘spatstat’ package will be introduced to determine an appropriate kernel size for the KDE analysis. In addition, the Monte-Carlo Simluation will be adopted to ensure that the kernel size is statistically significant. More will be discussed in the next section.  
 
Some literatures also highlighted certain inherent limitations, one of which is that KDE is unable to show the distance where spatial patterns become significant. The paper, “Identification of hazardous road locations of traffic accidents by means of KDE and cluster significance evaluation”, explored the use of KDE in determining areas with a high potential of road traffic accidents. More importantly, it also introduced the ‘Monte-Carlo Simulation’, a statistical technique that uses repeated random simulations to determine properties of event and their significance level. By combining both techniques, it allowed the researcher to identify the clusters of traffic accident that are statistically significant. Thus, to ensure the accuracy of our study, the L-function and ‘bw.diggle’, a function in R studio’s ‘spatstat’ package will be introduced to determine an appropriate kernel size for the KDE analysis. In addition, the Monte-Carlo Simluation will be adopted to ensure that the kernel size is statistically significant. More will be discussed in the next section.  
  
<INSERT TABLE HERE>
 
 
'''<big><font color="#fcb706">2.2 Summary Statistics for '1.0 Original Data'</font></big>'''
 
[[File:Group08 obike company Summary Statistics 1.png|300 px|centre|]]
 
<div align="center">Figure 2: Summary Statistics for Original Data (JMP PRO)</div>
 
 
As shown in figure 2 above, although there is a total of 3,014 rows present, a significant number of rows had missing fields. In particular, four columns had a strikingly high number of blanks – ‘Completed Time’, ‘Duration’, ‘# of Bikes’ and ‘Arrange To. Of these, the ‘# of Bikes’ column had the highest percentage of blanks i.e. 99.6%.  Reasons and consequences of such blanks will be further discussed in section 3.0 below.
 
  
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >3.0 Data Quality Issues & Consequences</font></div>==
+
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >3.0 Spatial Point Pattern Analysis Methods</font></div>==
  
This section will elaborate in detail the data quality issues faced by the team, as well as the consequences and limitations resulting from these issues. In sum, the data quality issues faced by the team can be broken down into five broad categories – Addresses, No. of Bikes, Authority, Status and General/Miscellaneous.
+
'''<big><font color="#fcb706">3.1 Kernel Density Estimation </font></big>'''<br>
  
 +
<b>3.1.1 Origin of Kernel Density Estimation</b> <br>
  
'''<big><font color="#fcb706">3.1 Original Address / Location</font></big>'''
+
<b>3.1.2 The Kernel Density Estimation Function</b> <br>
  
Although the ‘Original Address’ column had zero blanks, it was plagued with numerous data quality issues which severely hindered our data cleaning process.  
+
<b>3.1.3 Hotspot Mapping Using Kernel Density Estimation Function</b> <br>
  
 +
'''<big><font color="#fcb706">3.2 Ripley's K Function</font></big>'''
  
<b>3.1.1 Vague Descriptions </b> <br>
+
<b>3.2.1 Interpretation of Ripley's K Function Function</b> <br>
Many entries contained extremely vague descriptions of locations located in Singapore. Such vague locations would render it impossible
 
to plot an accurate location of the illegal parking cases on a map. Vague descriptions inevitably mean that Google API will return pre-
 
pinned longitude and latitude coordinates for such cases. In other words, Google API returns the exact same geographical coordinates
 
even if the actual location of the bicycles were different. Consequently, when translating these coordinates onto a map, these points
 
will overlap/combine into one distinct point, which is visually misleading to the user.
 
  
 
[[File:Group08 ABC bike-sharing company Google Map.png|500 px|centre|]]
 
[[File:Group08 ABC bike-sharing company Google Map.png|500 px|centre|]]
 
<div align="center">Figure 3: Visual Representation of East Coast Park Singapore</div>
 
<div align="center">Figure 3: Visual Representation of East Coast Park Singapore</div>
  
 +
'''<big><font color="#fcb706">3.3 L-Function: Derivative of Ripley's K Function </font></big>'''<br>
  
For instance, ‘East Coast Park’ is a beach park stretching from Marina East to Bedok planning areas in Singapore and covers 185 hectares of land. Therefore, tickets could have been issued anywhere in this area between Marina East Drive and Water Venture Coast (see Figure 3 above for visual representation). However, feeding the location ‘East Coast Park Singapore’ to Google API will only generate longitude and latitude coordinates of 103.9121866 and 1.3007842 respectively. This means that vague descriptions have an inherently high margin of error. Therefore, vague descriptions restrict our ability to derive meaningful insights from analysis. Other vague descriptions include but are not limited to ‘Bedok Reservoir’, ‘Coney Island’, ‘Bishan-Ang Mo Kio Park’, and ‘Whole stretch of Admiralty Street’.
+
'''<big><font color="#fcb706">3.4 'Bw.Diggle' Function, An Alternative Method To Approximate A Kernel Bandwidth </font></big>'''<br>
Further, from an operational perspective, the more specific the location, the easier and faster it will be for ABC bike-sharing company to track down the bicycles for redeployment.
 
 
 
Further, from an operational perspective, the more specific the location, the easier and faster it will be for ABC bike-sharing company to track down the bicycles for redeployment.
 
 
 
 
 
<b>3.1.2 Overly Specific Descriptions </b> <br>
 
 
 
In stark contrast to the above, there are also descriptions of locations which are overly specific. Overly specific descriptions hinders the efficiency and ability of Google API to return an accurate geographical coordinate. For example, although descriptions such as ‘Left at rubbish chute 4 and behind washing bay of Blk 227 Serangoon Ave 4’ may be useful in helping ABC bike-sharing company find illegally parked bicycles, feeding such lengthy descriptions to Google API interferes with its ability to pick up keywords, so chances are, it may return zero results. In other cases, Google API may pick up the wrong keywords and therefore return inaccurate geographical coordinates.
 
 
 
 
 
<b>3.1.3 Multiple Locations </b> <br>
 
 
 
Some of the ‘location’ fields contain not one, but multiple addresses. In most cases, the addresses are in proximity to one another. For instance, although there are three distinct locations in the description ‘Blk 713, 714 and 716 Pasir Ris Drive 3’, these locations are adjacent to one another. In other instances such as ‘Swettenham Road / Ridout Road / Peel Road / Peirce Road’, the locations are situated on different roads that are further apart.
 
Most times, if Google API receives more than one location, it will simply return the geographical coordinates of the first. In the case of the former, where multiple locations are adjacent to one another, this does not pose much of a problem as the longitude and latitude coordinates are still somewhat accurate. Put simply, Blk 713 Pasir Ris Drive 3’s geographical coordinates are similar to that of Blk 714 and 716 due to the close proximity of the locations. However, for the latter case, if Google API were to only return me the coordinates of the first location i.e. Swettenham Road, this coordinates may differ significantly from the remaining locations (Pierce Road), thereby returning unrepresentative results.
 
 
 
 
 
<b>3.1.4 Junctions </b> <br>
 
 
 
Some of the addresses provided refer to junction between roads, with no particular landmark given. Such locations are recorded in the following format ‘Junction between Road X and Road Y’. For a handful of these entries, Google API is able to recognise the term ‘junction’ and therefore returns the coordinates of the junction between roads. However, it is noticed that its ability to do so is largely inconsistent. Most times, Google API picks up only the first road name, and returns the pre-pinned coordinates for that particular road. Thus, the outputs of Google API will contain erroneous results if fed such entries without first cleaning them.
 
 
 
 
 
<b>3.1.5 Spelling Errors & Acronyms </b> <br>
 
 
 
It was also apparent that some of the entries contained spelling errors due to human error. For instance, ‘Bedok Reservoir’ was misspelled as ‘Bedok Resevoir’ and ‘Buona Vista’ was spelt as ‘Buana Vista’. Although Google API was still able to still detect the locations for some of these descriptions, the output was usually inconsistent, thereby raising the occurrences of inaccurate outputs.
 
Additionally, several entries contained acronyms such as ‘ECP’ and ‘CCK’, which stands for East Coast Park and Choa Chu Kang respectively. Although these terms are commonly used by Singaporeans to identify locations, Google API is unable to recognise majority of such terms and therefore will not return any geographical coordinates. In addition, less commonly used acronyms were also used, such as ‘BNA’, ‘MSCP’, ‘PRP’, which stands for Bedok North Avenue, Multi-storey Carpark, and Pasir Ris Park respectively. For such terms, data cleaning was absolutely necessary in order to retrieve the latitude and longitude coordinates.
 
 
 
 
 
'''<big><font color="#fcb706">3.2 Number of Bikes</font></big>'''
 
 
 
As seen in Figure 2, the ‘# of bikes’ column had the highest absolute number and percentage of blanks. Upon closer look, it was observed that for some entries, the number of bikes were given, but was recorded in the ‘location’ column together with the addresses instead. Nevertheless, for majority of the entries, information regarding the number of bicycles were indeed missing. The rationale behind this is that authorities issuing the fines almost never notify bicycle-sharing companies such as ABC bike-sharing company about the number of bicycles present in a location. Accordingly, on ABC bike-sharing company’s end, they are unable to record information which they do not possess. Authorities claim there are frequently too many bicycles in given location by all three operators – Ofo, MABC bike-sharing company and ABC bike-sharing company – so it becomes too time consuming to manually count the number of bicycles belonging to each company.
 
 
 
Admittedly, however, this information is crucial from a business standpoint. As the bicycles are currently fined on a per bike basis, the number of bicycles determines the monetary value of the fines and therefore, would govern the operations pertaining to ABC bike-sharing company’s bicycle redeployment strategy. On our part too, it is impossible to accurately predict ‘hotspots’ in terms of the dollar amount of fines ABC bike-sharing company is receiving. This lack of information is considered a severe limitation of this project.
 
For instance, theoretically, if ABC bike-sharing company receives two fines simultaneously – Clementi (30 bikes) and Ang Mo Kio (1 bike), it makes more business sense for them to assign their limited resources to address the ticket containing more bicycles first i.e. Clementi, as the value of fines would be $15,000 and $500 respectively. Without knowing the number of bikes in each ticket, ABC bike-sharing company would be lacking key information. In other words, they are currently assigning their limited resources blindly.
 
 
 
 
 
'''<big><font color="#fcb706">3.3 Authority</font></big>'''
 
 
 
Although this column was mostly filled up, three entries remained blanks. One of them had a high number of bicycles (80) in the ticket issued. Interestingly, it was also noted that apart from ‘LTA’, ‘NParks’ and ‘TC’, there was also another category labelled as ‘Others’ that contained only one entry. As the value of fines vary depending on the authority who issues it e.g. LTA fines $500 per bike while NParks does not yet issue fines, the authority field is also important as it determines the urgency of responding to fines. The higher the potential amount of fines, the more urgent the ticket becomes.
 
 
 
 
 
'''<big><font color="#fcb706">3.4 Status</font></big>'''
 
 
 
Status typically falls in one of the following categories – ‘ignore’, ‘arranging’ and ‘completed’. Despite the low percentage of blanks for this column, it is observed that 995 out of 3039 entries (32.74%) are classified as ‘arranging’. This suggests that ABC bike-sharing company does not consistently update the status of the ticket response. Accordingly, it is difficult to draw insights on the efficiency of their third party contractors.
 
 
 
 
 
'''<big><font color="#fcb706">3.5 Codes</font></big>'''
 
 
 
Codes are meant to enable ABC bike-sharing company to uniquely identify each ticket received. However, for this dataset, the code is not unique. As shown in figure 2 above, there a 3013 instances of codes, but only 2320 categories. This suggests that some codes are not unique. A unique identifier helps to trace the progress and response of each ticket. For LTA tickets, the codes do represent a unique identifier. However, for tickets issued by NParks and Town Council, the codes are not unique. For one, codes for tickets by TC only represent an area which is governed by a particular town council e.g. ‘Clementi’, ‘Woodland’, ‘AMKTC’. These codes have multiple entries under them, making it difficult to track individual ticket cases due to the lack of a unique identifier.
 
 
 
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >4.0 Data Cleaning & Preparation</font></div>==
 
 
 
In light of the aforementioned data quality issues, data cleaning had to be undertaken to minimise the negative consequences that may arise. This section elaborates further on the steps taken during the data cleaning and preparation process.
 
 
 
 
 
'''<big><font color="#fcb706">4.1 Manual Cleaning of Addresses</font></big>'''
 
 
 
Considering that the location field posed numerous data quality issues, coupled with the fact that Google API requires the location field to be in a relatively neat and standardised format, several measures had to be taken to tidy up the locations.
 
 
 
 
 
<b>4.1.1 Adding Singapore </b> <br>
 
 
 
Some road names and landmarks in Singapore have similar names to other locations in other countries, e.g. Arthur Road can not only be found in Singapore but also in the UK and India. That said, it is important to add the term ‘Singapore’ behind each location to ensure that the output received relates to a distinct place in Singapore, and not elsewhere.
 
 
 
 
 
<b>4.1.2 Correcting Acronyms and Spelling Errors </b> <br>
 
 
 
For descriptions containing acronyms and spelling errors, we had to correct them accordingly. For instance, ‘Blk 628 HG Ave 8’ would have to be changed to ‘Blk 628 Hougang Ave 8’. Via the excel find and replace function, this could be done to multiple entries at once, thereby speeding up the process.
 
 
 
 
 
<b>4.1.3 Splitting Multiple Locations </b> <br>
 
 
 
For entries with multiple addresses, it was necessary to split them into individual locations depending on how far apart these locations are. For instance, with reference to Figure 4 below, there are three locations captured in a single entry – Springwood Crescent, Springwood Avenue and Jalan Mat Jambol. However, as Springwood Crescent and Springwood Avenue and streets adjacent to each other, they are kept together as a single entry. On the other hand, Jalan Mat Jambol is further away and hence, requires us to split the entry into two. Consequently, one entry with three distinct locations are split into two.
 
 
 
To ensure consistent splitting of entries, an arbitrary benchmark was set at three decimal places in the longitude or latitude coordinates. In other words, if the third decimal place of the longitude and latitude coordinates differ, then the entries are split up. Differences in the third decimal place for longitude and latitude coordinates means that the locations are more than 110m apart in radius. However, the data currently still requires further refinement as not all points are yet consistent with this benchmark.
 
 
 
 
 
<b>4.1.4 Removing Excessive Descriptions </b> <br>
 
 
 
Since excessive descriptions interfere with Google API’s ability to produce accurate outputs, they have to be manually removed from the location fields. For instance, ‘Bus stop 11531. Alexandra. Near IKEA region. Pls remove these asap and report back to me when it is done’ will simply be changed to ‘Bus stop 11531 Singapore’. Otherwise, Google API may pick up the keyword ‘IKEA’ and return the coordinates for ‘IKEA Tampines’ instead. Removing excessive descriptions helps to improve both the efficiency and accuracy of Google API.
 
In addition, it was noticed that in some entries, authorities did provide information regarding the number of bikes, but this information was recorded in the ‘location’ column instead – e.g. ‘2 ABC bike-sharing company infront 368 Thomson Condo along Jalan Raja Udang’. For such rows, we would delete the excess descriptions, and insert the number of bikes into the ‘# of bikes’ column instead. Put simply, we will manually fill in the ‘# of bikes’ in the event where information is provided.
 
 
 
 
 
<b>4.1.5 Pinning Junctions </b> <br>
 
 
 
[[File:Group08_ABC bike-sharing company_Pinning_Junctions.png|300 px|centre|]]
 
<div align="center">Figure 5: Manually Pinning Junctions</div>
 
 
 
 
 
To resolve the inherent problems involved with feeding junctions between two roads to Google API as discussed in section 3.1.4, we first had to filter out addresses containing the keyword ‘junction’. One of such locations would be ‘Junction of Balestier Rd and Jln Kemaman’. Using the road names, we would then use google maps on web browser to produce an image as shown by Figure 5 above.
 
 
 
 
 
From the image, it is then possible to visualise the junction between the roads. Following which, we will manually pin the location as shown in figure 5. Doing so will return the latitude and longitude coordinates; in this case they are 1.325717 and 103.849777 respectively.
 
Having these coordinates, we will then input them into the csv file accordingly. Although there are still flaws present with this method i.e. pinning in slightly to the left or right will produce different coordinates, it is nonetheless still more accurate than simply allowing Google API to generate the geographical coordinates. Hence, the accuracy is still improved.
 
 
 
 
 
<b>4.1.6 No Landmark Locations </b> <br>
 
 
 
Due to the nature of the point to point transport, some bicycles are parked in areas with no specific landmarks, or are located between landmarks. For instance, ‘between Blk 126 and Blk 124 Simei Street’ implies that the bicycle was neither parked at Blk 126 or 124, but instead was found somewhere between these blocks. For such cases, we follow the method described above (section 4.1.5) to check if there is indeed no landmark present. If so, we will then simply take the first address recorded i.e. Blk 126 Simei Street.
 
 
 
 
 
<b>4.1.7 Vague Descriptions </b> <br>
 
 
 
As mentioned, vague descriptions result in inaccurate and sometimes even misleading results. However, due to the lack of information and nature of the data issue, it is impossible to have improve the current descriptions unless authorities decide to be more specific with their descriptions. Ergo, there is nothing that can be done short of asking ABC bike-sharing company to request for better quality descriptions. Such locations are simply fed to Google API to obtain the pre-pinned geographical coordinates.
 
 
 
  
'''<big><font color="#fcb706">4.2 Bus Stop Codes</font></big>'''
+
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >4.0 Case Study of Singapore and a Bike-sharing Firm</font></div>==
  
As bus stops codes represent unique identifiers for bus stops around Singapore, it is possible to get very specific longitude and latitude coordinates directly from LTA Data Mall. The rationale for using LTA’s data mall over Google API is that the former contains the most updated and relevant information in Singapore, and should therefore produce better quality outputs. An R-code is the used to speed up the process of retrieving the geographical coordinates for all bus stop locations. Please refer to Appendix Section 8.1 for the r-script used.
+
'''<big><font color="#fcb706">4.1 Dataset Description </font></big>'''<br>
  
 +
<b>4.1.1 Geocoding Process</b> <br>
  
'''<big><font color="#fcb706">4.3 Geocoding with Google API</font></big>'''
+
<b>4.1.2 Classification of Locations Based on Certainty Level</b> <br>
  
For the remaining of the data points with no geographical coordinates obtained, an r-script was ran via rStudio to retrieve the longitude and latitude values. However, it was noticed that even after cleaning up the data, many of the entries still returned ‘NA’ as output. Moreover, it was noted that even specific addresses with postal codes could result in a ‘NA’ output (refer to figure 6 for a screenshot of the raw output by Google API). This was in fact due to a problem inherent with Google API. Without a commercial license, Google API has a query limit that prevents it from retrieving the coordinates. Therefore, the r-script was revised to include a loop such that if the output is ‘NA’, it will re-run the same entry three more times. This revision greatly improved the efficiency of the code. Refer to Appendix Section 8.2 and 8.3 for the original and revised geocode r-scripts.
 
  
'''<big><font color="#fcb706">4.4 Reverse Geocoding</font></big>'''
+
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >5.0 Application of Geo-spatial Point Pattern Analytical Methods on A Case Study of Singapore</font></div>==
  
Lastly, as a checking step, an r-script containing a reverse geocode was used. The reverse geocode works by taking the longitude and latitude coordinates generated by Google API to produce an address. Since each longitude and latitude is certain to have a corresponding address, the r-script was written to loop until the an address is returned. This address was then checked with the original location to ensure that it matches. This steps is to try and reduce the inconsistencies present in Google API. Refer to Appendix Section 8.4 for the reverse geocode r-script.
+
'''<big><font color="#fcb706">5.1 Narrowing of Study Area Using QGIS Choropleth Map</font></big>'''
  
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >5.0 Exploratory Data Analysis and Interim Findings</font></div>==
+
'''<big><font color="#fcb706">5.2 Determining Spatial Patterns Using Spatstat's Modified L-Function on R Studio</font></big>'''
  
'''<big><font color="#fcb706">5.1 Revised Summary Statistics </font></big>'''
+
<b>5.2.1 Plotting The Modified L-Function Graph on 'R Studio' </b> <br>
  
After data cleaning and preparation, summary statistics was once again ran to get an overview of the cleaned data. This is shown in Figure 7 below.
+
'''<big><font color="#fcb706">5.3 Obtaining the Modified L-Function PLot and Kernel Radius</font></big>'''
  
 +
<b>5.3.1 Optimal Kernel Density Radius Obtained Using Spatstat's 'bw.diggle' function</b> <br>
  
[[File:Group08_ABC bike-sharing company_Summary_Statistics_2.jpg|300 px|centre|]]
+
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >6.0 Findings and Analysis</font></div>==
<div align="center">Figure 7: Revised Summary Statistics for Cleaned Data (JMP Pro)</div>
 
  
 +
'''<big><font color="#fcb706">6.1 Comparison of Bedok and Jurong West Using KDE on QGIS</font></big>'''
  
In comparison to the summary statistics performed on the original data set given by ABC bike-sharing company (see figure 2), it should be noted that the cleaned data set now has 17 columns instead of the original 13 columns. Some column titles have been edited e.g. ‘Location’ is now referred to as ‘Original Address’. Columns that have been omitted include ‘remarks’, while new columns inserted include ‘Original ID’, ‘New ID’, ‘Updated Address’, ‘Time Period’, Longitude’ and ‘Latitude’.  
+
<b>6.1.1 Bedok Subzone Heatmap Analysis</b> <br>
  
  
More specifically however, there are additional things to note. Firstly, the number of entries have now increased to 3041 due to the splitting of rows with multiple addresses. As such, there are more data points now than before. Secondly, ‘Reported Time’, ‘Completed Time’ and ‘Due Time’ have been changed to time format in order to perform appropriate analysis. Thirdly, the ‘# of bikes’ columns have now increased from 12 to 266. This increase comes from the data cleaning process (section 3.1.4), whereby number of bikes were included in the ‘location’ column instead. It is observed that although the increase is significant, a large proportion of the fields are still blanks. Therefore, this column is not used in the current analysis. Lastly, in the original data file, there were some dashes, ‘NA’s and ‘Nulls’. These were all standardised to blanks instead.
+
<b>6.1.2 Jurong-West Subzone Heatmap Analysis </b> <br>
  
<br><center> [[Image:ANLY482 AY2017-18 T2 Confidential Image.png|500px|center]]</center><br>
 
  
The remaining Exploratory Data Analysis findings will not presented here on this WIkipage due to the confidential nature of the findings and data. Thank you for your kind understanding.
+
'''<big><font color="#fcb706">6.2 Evaluating Placement of Yellow-boxes</font></big>'''
  
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >6.0 Going Forward</font></div>==
+
'''<big><font color="#fcb706">6.3 Analysis of Illegal Bike-Parking Patterns in Bedok by Time Period</font></big>'''
  
Although a significant amount of work has been done till date, there remains a lot still to be done. This section will cover the revised scope of work and methodology going forward.  
+
<b>6.1.1 Bedok Subzone Heatmap Analysis</b> <br>
 +
<b>6.1.1 Bedok Subzone Heatmap Analysis</b> <br>
 +
<b>6.1.1 Bedok Subzone Heatmap Analysis</b> <br>
  
 +
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >7.0 Conclusion | Key Takeaways and Considerations</font></div>==
  
'''<big><font color="#fcb706">6.1 New Data Sets</font></big>'''
+
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >8.0 References</font></div>==
  
It was revealed that authorities issue tickets via a few methods including Whatsapp and E-mails. Additionally, the ‘OneService’ mobile application allows residents to directly report illegal parking cases that occur in their neighbourhood; these fall under tickets issued by the Municipal Services Office, MSO for short.  
+
Anderson, T. (2009). Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis and Prevention, 41(3), 359-364.
  
 +
Bw.diggle function. (n.d.). Retrieved April 8, 2018, from https://www.rdocumentation.org/packages/spatstat/versions/1.55-0/topics/bw.diggle
  
<b>6.1.1 Additional Data Points </b> <br>
+
Bíl, Andrášik, & Janoška. (2013). Identification of hazardous road locations of traffic accidents by means of kernel density estimation and cluster significance evaluation. Accident Analysis and Prevention, 55, 265-273
  
ABC bike-sharing company records their tickets received via Whatsapp in one csv file, and those via e-mails and MSO on another csv file. Thus, it came to our attention that we have only received the former and not the latter file due to oversights from ABC bike-sharing company. Also, we have requested for more recent data from the month of January. As such, we have since received two additional csv files from ABC bike-sharing company, and are in the midst of cleaning them. Analysis derived from these additional entries will be presented during the final phase.  
+
Chainey, S., Tompson, L., & Uhlig, S. (n.d.). The Utility of Hotspot Mapping for Predicting Spatial Patterns of Crime. Retrieved April 8, 2018, from https://www.e-education.psu.edu/geog884/sites/www.e-education.psu.edu.geog884/files/image/lesson2/Chainey et al. (2008).pdf
  
 +
China's ‘Uber for bikes’ model is going global. Retrived from https://www.weforum.org/agenda/2017/06/china-leads-the-world-in-bike-sharing-and-now-its-uber-for-bikes-model-is-going-global/
  
 +
Chapter 11 Point Pattern Analysis / Github https://mgimond.github.io/Spatial/point-pattern-analysis.html
 +
Diggle, Peter. (1985). A Kernel Method for Smoothing Point Process Data. Applied Statistics. 34. 138-147. 10.2307/2347366.
  
<b>6.1.2 Yellow Boxes Coordinates </b> <br>
+
Dixon, Philip M., "Ripley’s K function" (2001). Statistics Preprints. 52. http://lib.dr.iastate.edu/stat_las_preprints/52
  
To help us going forward, ABC bike-sharing company has shared with us a file containing the coordinates of all yellow boxes in Singapore. However, as new yellow boxes are being painted each day, this file is subject to changes. Nevertheless, with the yellow boxes coordinates, it is possible to derive more in-depth analysis of greater scope. This will also enable us to answer objective 3 i.e. where should ABC bike-sharing company place yellow boxes around Singapore.  
+
Gesler, W. (1986). The uses of spatial analysis in medical geography: A review. Social Science & Medicine, 23(10), 963-973.
  
 +
Hashimoto, Yoshiki, Saeki, Mimura, Ando, & Nanba. (2016). Development and application of traffic accident density estimation models using kernel density estimation. Journal of Traffic and Transportation Engineering (English Edition), 3(3), 262-270.
  
'''<big><font color="#fcb706">6.2 New & Better Codes</font></big>'''
+
Kiskowski, & Hancock, & Kenworthy. (2009, May) On the Use of Ripley's K-Function and Its Derivatives to Analyze Domain Size. Retrieved from, http://www.cell.com/biophysj/abstract/S0006-3495(09)01048-0
  
With the additional data points given to us, the data cleaning process will have to be repeated. However, for the upcoming iteration, we hope to improve the data cleaning process via two ways – improving efficiency and improving accuracy.  
+
Kobylińska, K., Cellmer, R., Źróbek, S., & Lepkova, N. (2017). Using Kernel density estimation for modelling and simulating transaction location. International Journal of Strategic Property Management, 21(1), 29-40.
 +
Li, Wei, Huang & Ye. (2008). Spatial patterns and interspecific associations of three canopy species at different life stages in a subtropical forest, China. Retrieved from, http://www.jipb.net/tupian/2008/3/18/163001.pdf
  
 +
Lim, Kenneth. (2017) Bike-sharing in Singapore: A look at the road ahead. The Channel News Asia. Retrieved from.
 +
https://www.channelnewsasia.com/news/singapore/bike-sharing-in-singapore-a-look-at-the-road-ahead-8867898
  
<b>6.2.1 Improving Efficiency </b> <br>
+
Minoiu, C., & Reddy, S. (2008). Kernel density estimation based on grouped data : The case of poverty assessment , Washington, District of Columbia : International Monetary Fund (IMF working paper ; WP/08/183).
  
Although the r-script for the geocoding process has already been revised, we are in the midst of creating a new r-script such that it will be able to exit the existing session and begin a new one each time a ‘NA’ result is obtained. The rationale behind this is to avoid the query limit and reduce the number of ‘NA’s returned.  
+
Ripley’s K function Philip M. Dixon Volume 3, pp 1796–1803 in Encyclopedia of Environmetrics (ISBN 0471 899976) https://www3.nd.edu/~mhaenggi/ee87021/Dixon-K-Function.pdf
  
 +
Silverman, B. (1978). Weak and Strong Uniform Consistency of the Kernel Estimate of a Density and its Derivatives. The Annals of Statistics, 6(1), 177-184.
  
<b>6.2.2 Improving Accuracy </b> <br>
+
Shaheen, S., Guzman, S., & Zhang, H. (2010). Bikesharing in Europe, the Americas, and Asia: Past, Present, and Future.
 +
Spencer, J., & Angeles, G. (2007). Kernel density estimation as a technique for assessing availability of health services in Nicaragua. Health Services and Outcomes Research Methodology, 7(3), 145-157.
  
It has been observed that certain entries contain lamppost numbers in Singapore. However, our current code and Google API does not enable us to track the lampposts to retrieve their longitude and latitude values. As such, we are currently looking for alternative ways to clean addresses with lamppost numbers. If this can be done, the results obtained will be more accurate.
+
Silverman, B.W. (2012, Mar) DENSITY ESTIMATION FOR STATISTICS AND DATA ANALYSIS B.W. Silverman. Retrieved from, https://ned.ipac.caltech.edu/level5/March02/Silverman/paper.pdf
  
 +
Tania L King, Lukar E Thornton, Rebecca J Bentley, & Anne M Kavanagh. (n.d.). The Use of Kernel Density Estimation to Examine Associations between Neighborhood Destination Intensity and Walking and Physical Activity. PLoS ONE, 10(9), E0137402.
  
 +
The Economist. (2017, Dec 19). How bike-sharing conquered the world. Retrieved from, https://www.economist.com/news/christmas-specials/21732701-two-wheeled-journey-anarchist-provocation-high-stakes-capitalism-how
  
'''<big><font color="#fcb706">6.3 Revised Scope of Work</font></big>'''
+
Turlach, Berwin. (1999). Bandwidth Selection in Kernel Density Estimation: A Review. Technical Report.
  
At the beginning on this project, ABC bike-sharing company had requested for us to focus on one particular area, and analyse the user routes (i.e. start points and end points) pertaining to bicycles in the chosen area. This would help us to determine where the yellow boxes should be placed. However, it has recently been mentioned that the data set containing bicycle routes is highly private and confidential. Consequently, there is a chance that we cannot be privy to this dataset. Hence, the scope of work may be revised once more. We are currently in the midst of discussions.
+
Xun Shi (2010) Selection of bandwidth type and adjustment side in kernel density estimation over inhomogeneous backgrounds, International Journal of Geographical Information Science, 24:5, 643-660, DOI: 10.1080/13658810902950625
  
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >7.0 Conclusion</font></div>==
+
Zambom, A., & Dias, R. (2012). A Review of Kernel Density Estimation with Applications to Econometrics.
  
In conclusion, keeping in mind that the current analysis is not run using the complete data set, we have to take the interim findings with a pinch (or rather, handful) of salt. Till date, the greatest problem faced would be regarding the vague descriptions as there is nothing we can do to improve it on our part. In light of this, it is highly recommended that operationally, ABC bike-sharing company should approach or request that the authorities be more specific with their descriptions, or better yet, provide them with the coordinates directly. Otherwise, it is also difficult for ABC bike-sharing company to cooperate even if they wanted to. That said, this also highlights the importance of data quality. As with many start-up companies, the data collection process is not yet standardised and are still in the midst of improving. This therefore results in a lot of data cleaning and manual work on the back end. Thus going forward, ABC bike-sharing company should think of better and more innovative solutions to record and store such important business information. Albeit the cost involved, there is definitely rewards to be reaped.
+
==<div style="background: #404040; padding: 15px; font-weight: bold; line-height: 0.3em; text-indent: 15px; font-size: 16px"><font color=#ffffff >9.0 Acknowledgement</font></div>==
 +
We would to graciously thank Professor Kam Tin Seong (Associate Professor of Information Systems; Senior Advisor, SIS) and Instructor Meenakshi who provided our team with great insights and guidance throughout this entire project. We would also like to thank our sponsor for graciously providing us with dataset and assistance.

Revision as of 17:21, 16 April 2018

Homepage

Our Team

Project Overview

Project Findings

Project Management

Documentation

Other AY2017-18 T2 Projects

Interim Final

1.0 Introduction and Project Background

In today’s world, the convenient ad-hoc access provided by digital systems is taking the place of the assured access once offered by personal ownership (The Economist, 2017). For instance, streaming beats records, cloud-system beats hard disk; credit beats cash. A similar phenomenon is occurring in the transportation industry, with the introduction of bike-sharing. Bike-sharing programs have existed for almost 50 years, but in the last decade, there has been a sharp increase in both their prevalence and popularity worldwide (Fishman, Elliot, Washington, Simon, & Haworth, 2013). Bike-sharing is a sustainable mobility strategy developed in response to concerns regarding global climate change, energy security and unstable fuel prices (Shaheen, Guzman & Zhang, 2010). Although China is currently the world leader in bike-sharing schemes, it is observed that many countries including France, Europe and USA have begun adopting this model as well (Gray, 2017).

However, despite the good and convenience that bike-sharing have introduced, there have also been downsides to it. For instance, complaints of reckless riding and bad parking have stuck a wrench in the bike-sharing movement (Lim, 2017). Authorities had little choice but to step in and issue new regulations to minimise the “bad behaviour” common among bike-sharing users.


1.1 Motivations and Objectives
Majority of existing research surrounding the bike-sharing movement consists of studies conducted with two goals in mind:
1. Understanding business profitability and sustainability concerns
2. Gathering insights on bicycle routes taken by individuals to offer guidance to urban planners, policy makers and transportation practitioners


Little or no research has yet been done to shed light on the increasingly prominent issue of illegal parking patterns. Henceforth, this paper seeks to explore this further with the following objectives in mind:
1. Fill existing research gap by exploring the use of ‘Spatial Point Pattern Analysis’ in analyzing clustering patterns of illegal bike-parking
2. Specifically demonstrate the use of KDE and modified-L-function
3. Apply the tools to a real-life case study tools based on a case study of Singapore
4. Discuss key learning points and considerations in using the methods

2.0 Literature Review

Literatures on Kernel Density Estimation (KDE) and L-function were explored and reviewed in preparation of this research. Existing literatures showed that KDE is well-suited for analyzing spatial patterns, especially when there is a need to examine the intensity of a particular phenomenon. In the paper “Spatial distribution of diagnosed chronic kidney disease (CKD) in Edo State, Nigeria”, KDE was used to investigate spatial distribution of CKD across regions in Nigeria. The study was important because health outcomes generally involve people, thus the population at risk of CKD had to be determined. Studying the spatial patterns reflects the spatial distribution of the underlying population (Carlos et al. 2010), thus allowing the team to zero into the identified regions through the use of KDE. In relation to this paper, KDE will also be adopted in identifying locations with high intensity of clustering.

The second paper “The Utility of Hotspot Mapping for Predicting Spatial Patterns of Crime” presented the usefulness and accuracy of KDE in predictive SPPA. It compared various mapping techniques such as point-mapping, thematic-mapping of geographic areas (e.g. census areas), spatial ellipses and KDE, and identifies the one that most accurately predicts future crime occurrences (Chainey et al., 2008a). It split ‘crimes’ into four categories, after-which the different techniques were applied on them to identify the technique that best predicted future crime occurrence. It was found that KDE consistently outperformed all other techniques in its predictive capabilities for all the different crime types studied. Also, data used in this paper were geocoded crime data-points, in which coordinates were rounded off to the nearest 10m. This supposedly reduces the accuracy of the data-points as a crime could have been displaced by up to 5m in any direction of the actual location. However, it was concluded that small differences in locations of crime occurrence would not negatively impact the study’s findings. This is a useful for our paper as it illustrates that research of this nature should not be sensitive to small inaccuracies of the geographical coordinates used.

Some literatures also highlighted certain inherent limitations, one of which is that KDE is unable to show the distance where spatial patterns become significant. The paper, “Identification of hazardous road locations of traffic accidents by means of KDE and cluster significance evaluation”, explored the use of KDE in determining areas with a high potential of road traffic accidents. More importantly, it also introduced the ‘Monte-Carlo Simulation’, a statistical technique that uses repeated random simulations to determine properties of event and their significance level. By combining both techniques, it allowed the researcher to identify the clusters of traffic accident that are statistically significant. Thus, to ensure the accuracy of our study, the L-function and ‘bw.diggle’, a function in R studio’s ‘spatstat’ package will be introduced to determine an appropriate kernel size for the KDE analysis. In addition, the Monte-Carlo Simluation will be adopted to ensure that the kernel size is statistically significant. More will be discussed in the next section.


3.0 Spatial Point Pattern Analysis Methods

3.1 Kernel Density Estimation

3.1.1 Origin of Kernel Density Estimation

3.1.2 The Kernel Density Estimation Function

3.1.3 Hotspot Mapping Using Kernel Density Estimation Function

3.2 Ripley's K Function

3.2.1 Interpretation of Ripley's K Function Function

Figure 3: Visual Representation of East Coast Park Singapore

3.3 L-Function: Derivative of Ripley's K Function

3.4 'Bw.Diggle' Function, An Alternative Method To Approximate A Kernel Bandwidth

4.0 Case Study of Singapore and a Bike-sharing Firm

4.1 Dataset Description

4.1.1 Geocoding Process

4.1.2 Classification of Locations Based on Certainty Level


5.0 Application of Geo-spatial Point Pattern Analytical Methods on A Case Study of Singapore

5.1 Narrowing of Study Area Using QGIS Choropleth Map

5.2 Determining Spatial Patterns Using Spatstat's Modified L-Function on R Studio

5.2.1 Plotting The Modified L-Function Graph on 'R Studio'

5.3 Obtaining the Modified L-Function PLot and Kernel Radius

5.3.1 Optimal Kernel Density Radius Obtained Using Spatstat's 'bw.diggle' function

6.0 Findings and Analysis

6.1 Comparison of Bedok and Jurong West Using KDE on QGIS

6.1.1 Bedok Subzone Heatmap Analysis


6.1.2 Jurong-West Subzone Heatmap Analysis


6.2 Evaluating Placement of Yellow-boxes

6.3 Analysis of Illegal Bike-Parking Patterns in Bedok by Time Period

6.1.1 Bedok Subzone Heatmap Analysis
6.1.1 Bedok Subzone Heatmap Analysis
6.1.1 Bedok Subzone Heatmap Analysis

7.0 Conclusion | Key Takeaways and Considerations

8.0 References

Anderson, T. (2009). Kernel density estimation and K-means clustering to profile road accident hotspots. Accident Analysis and Prevention, 41(3), 359-364.

Bw.diggle function. (n.d.). Retrieved April 8, 2018, from https://www.rdocumentation.org/packages/spatstat/versions/1.55-0/topics/bw.diggle

Bíl, Andrášik, & Janoška. (2013). Identification of hazardous road locations of traffic accidents by means of kernel density estimation and cluster significance evaluation. Accident Analysis and Prevention, 55, 265-273

Chainey, S., Tompson, L., & Uhlig, S. (n.d.). The Utility of Hotspot Mapping for Predicting Spatial Patterns of Crime. Retrieved April 8, 2018, from https://www.e-education.psu.edu/geog884/sites/www.e-education.psu.edu.geog884/files/image/lesson2/Chainey et al. (2008).pdf

China's ‘Uber for bikes’ model is going global. Retrived from https://www.weforum.org/agenda/2017/06/china-leads-the-world-in-bike-sharing-and-now-its-uber-for-bikes-model-is-going-global/

Chapter 11 Point Pattern Analysis / Github https://mgimond.github.io/Spatial/point-pattern-analysis.html Diggle, Peter. (1985). A Kernel Method for Smoothing Point Process Data. Applied Statistics. 34. 138-147. 10.2307/2347366.

Dixon, Philip M., "Ripley’s K function" (2001). Statistics Preprints. 52. http://lib.dr.iastate.edu/stat_las_preprints/52

Gesler, W. (1986). The uses of spatial analysis in medical geography: A review. Social Science & Medicine, 23(10), 963-973.

Hashimoto, Yoshiki, Saeki, Mimura, Ando, & Nanba. (2016). Development and application of traffic accident density estimation models using kernel density estimation. Journal of Traffic and Transportation Engineering (English Edition), 3(3), 262-270.

Kiskowski, & Hancock, & Kenworthy. (2009, May) On the Use of Ripley's K-Function and Its Derivatives to Analyze Domain Size. Retrieved from, http://www.cell.com/biophysj/abstract/S0006-3495(09)01048-0

Kobylińska, K., Cellmer, R., Źróbek, S., & Lepkova, N. (2017). Using Kernel density estimation for modelling and simulating transaction location. International Journal of Strategic Property Management, 21(1), 29-40. Li, Wei, Huang & Ye. (2008). Spatial patterns and interspecific associations of three canopy species at different life stages in a subtropical forest, China. Retrieved from, http://www.jipb.net/tupian/2008/3/18/163001.pdf

Lim, Kenneth. (2017) Bike-sharing in Singapore: A look at the road ahead. The Channel News Asia. Retrieved from. https://www.channelnewsasia.com/news/singapore/bike-sharing-in-singapore-a-look-at-the-road-ahead-8867898

Minoiu, C., & Reddy, S. (2008). Kernel density estimation based on grouped data : The case of poverty assessment , Washington, District of Columbia : International Monetary Fund (IMF working paper ; WP/08/183).

Ripley’s K function Philip M. Dixon Volume 3, pp 1796–1803 in Encyclopedia of Environmetrics (ISBN 0471 899976) https://www3.nd.edu/~mhaenggi/ee87021/Dixon-K-Function.pdf

Silverman, B. (1978). Weak and Strong Uniform Consistency of the Kernel Estimate of a Density and its Derivatives. The Annals of Statistics, 6(1), 177-184.

Shaheen, S., Guzman, S., & Zhang, H. (2010). Bikesharing in Europe, the Americas, and Asia: Past, Present, and Future. Spencer, J., & Angeles, G. (2007). Kernel density estimation as a technique for assessing availability of health services in Nicaragua. Health Services and Outcomes Research Methodology, 7(3), 145-157.

Silverman, B.W. (2012, Mar) DENSITY ESTIMATION FOR STATISTICS AND DATA ANALYSIS B.W. Silverman. Retrieved from, https://ned.ipac.caltech.edu/level5/March02/Silverman/paper.pdf

Tania L King, Lukar E Thornton, Rebecca J Bentley, & Anne M Kavanagh. (n.d.). The Use of Kernel Density Estimation to Examine Associations between Neighborhood Destination Intensity and Walking and Physical Activity. PLoS ONE, 10(9), E0137402.

The Economist. (2017, Dec 19). How bike-sharing conquered the world. Retrieved from, https://www.economist.com/news/christmas-specials/21732701-two-wheeled-journey-anarchist-provocation-high-stakes-capitalism-how

Turlach, Berwin. (1999). Bandwidth Selection in Kernel Density Estimation: A Review. Technical Report.

Xun Shi (2010) Selection of bandwidth type and adjustment side in kernel density estimation over inhomogeneous backgrounds, International Journal of Geographical Information Science, 24:5, 643-660, DOI: 10.1080/13658810902950625

Zambom, A., & Dias, R. (2012). A Review of Kernel Density Estimation with Applications to Econometrics.

9.0 Acknowledgement

We would to graciously thank Professor Kam Tin Seong (Associate Professor of Information Systems; Senior Advisor, SIS) and Instructor Meenakshi who provided our team with great insights and guidance throughout this entire project. We would also like to thank our sponsor for graciously providing us with dataset and assistance.