ANLY482 AY2017-18T2 Group08 : Project Findings

From Analytics Practicum
Revision as of 01:24, 26 February 2018 by Cl.heng.2014 (talk | contribs)
Jump to navigation Jump to search

Homepage

Our Team

Project Overview

Project Findings

Project Management

Documentation

Other AY2017-18 T2 Projects

Interim Final


1.0 Project Recap

oBike, Singapore’s first home-grown stationless bicycle sharing company, began their operations in January 2017. However, in recent months, Singapore’s Land and Transport Authority (LTA) issued new rules and regulations that require bicycles to be parked in designated yellow boxes around the island. LTA enforcers, together with authorities from Town Council and NParks, survey the island, and issue tickets to bike-sharing companies in the event where bicycles are found to be outside of these yellow boxes. From the time a ticket is issued, oBike has a mere four hours to move their illegally-parked bicycles. Failure to do so will incur hefty fines.

As such, this practicum seeks to achieve the following objectives:- (i) Identify hotspots for illegal parking cases (ii) Project the illegal parking patterns by analysing historical data (iii) Determine suitable areas for yellow boxes to be painted To achieve the above objectives however, we had to first clean the data given and perform exploratory data analysis (EDA). That said, this interim report seeks to document the data cleaning process as well as EDA performed thus far. In addition, any key insights derived till date will also be shared.

2.0 About the Data

The csv files titled ‘Group08_oBike_InterimData’ contains four sheets with descriptions as follows:-

(i) 1.0 Cleaned Data Cleaned data refers to data that has already been cleaned via our data cleaning process, which will be described further in Section 4. The format for ‘1.0 Cleaned Data’ is similar to the original data given by oBike, except there are five newly inserted columns – ‘Original ID’, ‘New ID’, ‘Day’, ‘Updated Addresses’ and ‘Time Period’. This sheet will be used for analysis purposes. Please refer to Figure 1 below for the revised metadata.

(ii) 2.0 Original Data This sheet contains the original, raw data given, with the exception of the row ‘Original ID’ that was inserted for tracking purposes. There is a total of 14 columns in this sheet, inclusive of ‘Original ID.’ Please refer to Figure 1 below for the revised metadata.   (iii) 3.0 Appendix & Notes The purpose of this sheet is to highlight to any reader on the changes made to the original data set to allow for better comprehension of the data cleaning process. It contains notes relating to data points that were duplicated or removed.

(iv) 3.1 Cross Checking This sheet is used internally for our cross checking between ‘1.0 Cleaned Data’ and ‘2.0 Original Data’ to ensure that no error occurred when duplicating the data. Using the ‘LOCATION’ column which contains all unique entries of addresses, we cross checked to ensure that all the entries in the ‘1.0 Cleaned Data’ are found in the ‘2.0 Original Data’ and vice versa.


2.1 Revised Metadata


2.2 Summary Statistics for '1.0 Original Data'

Group08 oBike Summary Statistics 1.png
Figure 2: Summary Statistics for Original Data (JMP PRO)

As shown in figure 2 above, although there is a total of 3,014 rows present, a significant number of rows had missing fields. In particular, four columns had a strikingly high number of blanks – ‘Completed Time’, ‘Duration’, ‘# of Bikes’ and ‘Arrange To. Of these, the ‘# of Bikes’ column had the highest percentage of blanks i.e. 99.6%. Reasons and consequences of such blanks will be further discussed in section 3.0 below.

3.0 Data Quality Issues & Consequences

This section will elaborate in detail the data quality issues faced by the team, as well as the consequences and limitations resulting from these issues. In sum, the data quality issues faced by the team can be broken down into five broad categories – Addresses, No. of Bikes, Authority, Status and General/Miscellaneous.


3.1 Original Address / Location

Although the ‘Original Address’ column had zero blanks, it was plagued with numerous data quality issues which severely hindered our data cleaning process.


3.1.1 Vague Descriptions
Many entries contained extremely vague descriptions of locations located in Singapore. Such vague locations would render it impossible to plot an accurate location of the illegal parking cases on a map. Vague descriptions inevitably mean that Google API will return pre- pinned longitude and latitude coordinates for such cases. In other words, Google API returns the exact same geographical coordinates even if the actual location of the bicycles were different. Consequently, when translating these coordinates onto a map, these points will overlap/combine into one distinct point, which is visually misleading to the user.

Group08 oBike Google Map.png
Figure 3: Visual Representation of East Coast Park Singapore

For instance, ‘East Coast Park’ is a beach park stretching from Marina East to Bedok planning areas in Singapore and covers 185 hectares of land. Therefore, tickets could have been issued anywhere in this area between Marina East Drive and Water Venture Coast (see Figure 3 above for visual representation). However, feeding the location ‘East Coast Park Singapore’ to Google API will only generate longitude and latitude coordinates of 103.9121866 and 1.3007842 respectively. This means that vague descriptions have an inherently high margin of error. Therefore, vague descriptions restrict our ability to derive meaningful insights from analysis. Other vague descriptions include but are not limited to ‘Bedok Reservoir’, ‘Coney Island’, ‘Bishan-Ang Mo Kio Park’, and ‘Whole stretch of Admiralty Street’. Further, from an operational perspective, the more specific the location, the easier and faster it will be for oBike to track down the bicycles for redeployment.

Further, from an operational perspective, the more specific the location, the easier and faster it will be for oBike to track down the bicycles for redeployment.


3.1.2 Overly Specific Descriptions

In stark contrast to the above, there are also descriptions of locations which are overly specific. Overly specific descriptions hinders the efficiency and ability of Google API to return an accurate geographical coordinate. For example, although descriptions such as ‘Left at rubbish chute 4 and behind washing bay of Blk 227 Serangoon Ave 4’ may be useful in helping oBike find illegally parked bicycles, feeding such lengthy descriptions to Google API interferes with its ability to pick up keywords, so chances are, it may return zero results. In other cases, Google API may pick up the wrong keywords and therefore return inaccurate geographical coordinates.


3.1.3 Multiple Locations

Some of the ‘location’ fields contain not one, but multiple addresses. In most cases, the addresses are in proximity to one another. For instance, although there are three distinct locations in the description ‘Blk 713, 714 and 716 Pasir Ris Drive 3’, these locations are adjacent to one another. In other instances such as ‘Swettenham Road / Ridout Road / Peel Road / Peirce Road’, the locations are situated on different roads that are further apart. Most times, if Google API receives more than one location, it will simply return the geographical coordinates of the first. In the case of the former, where multiple locations are adjacent to one another, this does not pose much of a problem as the longitude and latitude coordinates are still somewhat accurate. Put simply, Blk 713 Pasir Ris Drive 3’s geographical coordinates are similar to that of Blk 714 and 716 due to the close proximity of the locations. However, for the latter case, if Google API were to only return me the coordinates of the first location i.e. Swettenham Road, this coordinates may differ significantly from the remaining locations (Pierce Road), thereby returning unrepresentative results.


3.1.4 Junctions

Some of the addresses provided refer to junction between roads, with no particular landmark given. Such locations are recorded in the following format ‘Junction between Road X and Road Y’. For a handful of these entries, Google API is able to recognise the term ‘junction’ and therefore returns the coordinates of the junction between roads. However, it is noticed that its ability to do so is largely inconsistent. Most times, Google API picks up only the first road name, and returns the pre-pinned coordinates for that particular road. Thus, the outputs of Google API will contain erroneous results if fed such entries without first cleaning them.


3.1.5 Spelling Errors & Acronyms

It was also apparent that some of the entries contained spelling errors due to human error. For instance, ‘Bedok Reservoir’ was misspelled as ‘Bedok Resevoir’ and ‘Buona Vista’ was spelt as ‘Buana Vista’. Although Google API was still able to still detect the locations for some of these descriptions, the output was usually inconsistent, thereby raising the occurrences of inaccurate outputs. Additionally, several entries contained acronyms such as ‘ECP’ and ‘CCK’, which stands for East Coast Park and Choa Chu Kang respectively. Although these terms are commonly used by Singaporeans to identify locations, Google API is unable to recognise majority of such terms and therefore will not return any geographical coordinates. In addition, less commonly used acronyms were also used, such as ‘BNA’, ‘MSCP’, ‘PRP’, which stands for Bedok North Avenue, Multi-storey Carpark, and Pasir Ris Park respectively. For such terms, data cleaning was absolutely necessary in order to retrieve the latitude and longitude coordinates.


3.2 Number of Bikes

As seen in Figure 2, the ‘# of bikes’ column had the highest absolute number and percentage of blanks. Upon closer look, it was observed that for some entries, the number of bikes were given, but was recorded in the ‘location’ column together with the addresses instead. Nevertheless, for majority of the entries, information regarding the number of bicycles were indeed missing. The rationale behind this is that authorities issuing the fines almost never notify bicycle-sharing companies such as oBike about the number of bicycles present in a location. Accordingly, on oBike’s end, they are unable to record information which they do not possess. Authorities claim there are frequently too many bicycles in given location by all three operators – Ofo, Mobike and oBike – so it becomes too time consuming to manually count the number of bicycles belonging to each company.

Admittedly, however, this information is crucial from a business standpoint. As the bicycles are currently fined on a per bike basis, the number of bicycles determines the monetary value of the fines and therefore, would govern the operations pertaining to oBike’s bicycle redeployment strategy. On our part too, it is impossible to accurately predict ‘hotspots’ in terms of the dollar amount of fines oBike is receiving. This lack of information is considered a severe limitation of this project. For instance, theoretically, if oBike receives two fines simultaneously – Clementi (30 bikes) and Ang Mo Kio (1 bike), it makes more business sense for them to assign their limited resources to address the ticket containing more bicycles first i.e. Clementi, as the value of fines would be $15,000 and $500 respectively. Without knowing the number of bikes in each ticket, oBike would be lacking key information. In other words, they are currently assigning their limited resources blindly.


3.3 Authority

Although this column was mostly filled up, three entries remained blanks. One of them had a high number of bicycles (80) in the ticket issued. Interestingly, it was also noted that apart from ‘LTA’, ‘NParks’ and ‘TC’, there was also another category labelled as ‘Others’ that contained only one entry. As the value of fines vary depending on the authority who issues it e.g. LTA fines $500 per bike while NParks does not yet issue fines, the authority field is also important as it determines the urgency of responding to fines. The higher the potential amount of fines, the more urgent the ticket becomes.


3.4 Status

Status typically falls in one of the following categories – ‘ignore’, ‘arranging’ and ‘completed’. Despite the low percentage of blanks for this column, it is observed that 995 out of 3039 entries (32.74%) are classified as ‘arranging’. This suggests that oBike does not consistently update the status of the ticket response. Accordingly, it is difficult to draw insights on the efficiency of their third party contractors.


3.5 Codes

Codes are meant to enable oBike to uniquely identify each ticket received. However, for this dataset, the code is not unique. As shown in figure 2 above, there a 3013 instances of codes, but only 2320 categories. This suggests that some codes are not unique. A unique identifier helps to trace the progress and response of each ticket. For LTA tickets, the codes do represent a unique identifier. However, for tickets issued by NParks and Town Council, the codes are not unique. For one, codes for tickets by TC only represent an area which is governed by a particular town council e.g. ‘Clementi’, ‘Woodland’, ‘AMKTC’. These codes have multiple entries under them, making it difficult to track individual ticket cases due to the lack of a unique identifier.

4.0 Data Cleaning & Preparation

In light of the aforementioned data quality issues, data cleaning had to be undertaken to minimise the negative consequences that may arise. This section elaborates further on the steps taken during the data cleaning and preparation process.


4.1 Manual Cleaning of Addresses

Considering that the location field posed numerous data quality issues, coupled with the fact that Google API requires the location field to be in a relatively neat and standardised format, several measures had to be taken to tidy up the locations.


4.1.1 Adding Singapore

Some road names and landmarks in Singapore have similar names to other locations in other countries, e.g. Arthur Road can not only be found in Singapore but also in the UK and India. That said, it is important to add the term ‘Singapore’ behind each location to ensure that the output received relates to a distinct place in Singapore, and not elsewhere.


4.1.2 Correcting Acronyms and Spelling Errors

For descriptions containing acronyms and spelling errors, we had to correct them accordingly. For instance, ‘Blk 628 HG Ave 8’ would have to be changed to ‘Blk 628 Hougang Ave 8’. Via the excel find and replace function, this could be done to multiple entries at once, thereby speeding up the process.


4.1.3 Splitting Multiple Locations

For entries with multiple addresses, it was necessary to split them into individual locations depending on how far apart these locations are. For instance, with reference to Figure 4 below, there are three locations captured in a single entry – Springwood Crescent, Springwood Avenue and Jalan Mat Jambol. However, as Springwood Crescent and Springwood Avenue and streets adjacent to each other, they are kept together as a single entry. On the other hand, Jalan Mat Jambol is further away and hence, requires us to split the entry into two. Consequently, one entry with three distinct locations are split into two.


Group08 oBike Splitting Multiple Locations.png
Figure 4: Data Cleaning for Multiple Locations


To ensure consistent splitting of entries, an arbitrary benchmark was set at three decimal places in the longitude or latitude coordinates. In other words, if the third decimal place of the longitude and latitude coordinates differ, then the entries are split up. Differences in the third decimal place for longitude and latitude coordinates means that the locations are more than 110m apart in radius. However, the data currently still requires further refinement as not all points are yet consistent with this benchmark.

4.1.4 Removing Excessive Descriptions

Since excessive descriptions interfere with Google API’s ability to produce accurate outputs, they have to be manually removed from the location fields. For instance, ‘Bus stop 11531. Alexandra. Near IKEA region. Pls remove these asap and report back to me when it is done’ will simply be changed to ‘Bus stop 11531 Singapore’. Otherwise, Google API may pick up the keyword ‘IKEA’ and return the coordinates for ‘IKEA Tampines’ instead. Removing excessive descriptions helps to improve both the efficiency and accuracy of Google API. In addition, it was noticed that in some entries, authorities did provide information regarding the number of bikes, but this information was recorded in the ‘location’ column instead – e.g. ‘2 oBike infront 368 Thomson Condo along Jalan Raja Udang’. For such rows, we would delete the excess descriptions, and insert the number of bikes into the ‘# of bikes’ column instead. Put simply, we will manually fill in the ‘# of bikes’ in the event where information is provided.

5.0 Exploratory Data Analysis and Interim Findings



6.0 Going Forward



7.0 Conclusion