ANLY482 AY2017-18T2 Group08 : Project Findings

From Analytics Practicum
Revision as of 13:03, 26 February 2018 by Shawn.pang.2014 (talk | contribs)
Jump to navigation Jump to search

Homepage

Our Team

Project Overview

Project Findings

Project Management

Documentation

Other AY2017-18 T2 Projects

Interim Final

0.0 Introductory Note and Disclaimer

Thank you for your interest in our project and analysis. Do kindly note that the content presented on Wikipedia consists of general information as well as the methodology undertaken. Company specific and confidential information has been omitted. We seek your kind understanding, thank you.

1.0 Project Recap

oBike, Singapore’s first home-grown stationless bicycle sharing company, began their operations in January 2017. However, in recent months, Singapore’s Land and Transport Authority (LTA) issued new rules and regulations that require bicycles to be parked in designated yellow boxes around the island. LTA enforcers, together with authorities from Town Council and NParks, survey the island, and issue tickets to bike-sharing companies in the event where bicycles are found to be outside of these yellow boxes. From the time a ticket is issued, oBike has a mere four hours to move their illegally-parked bicycles. Failure to do so will incur hefty fines.

As such, this practicum seeks to achieve the following objectives:- (i) Identify hotspots for illegal parking cases (ii) Project the illegal parking patterns by analysing historical data (iii) Determine suitable areas for yellow boxes to be painted To achieve the above objectives however, we had to first clean the data given and perform exploratory data analysis (EDA). That said, this interim report seeks to document the data cleaning process as well as EDA performed thus far. In addition, any key insights derived till date will also be shared.

2.0 About the Data

The csv files titled ‘Group08_oBike_InterimData’ contains four sheets with descriptions as follows:-

(i) 1.0 Cleaned Data Cleaned data refers to data that has already been cleaned via our data cleaning process, which will be described further in Section 4. The format for ‘1.0 Cleaned Data’ is similar to the original data given by oBike, except there are five newly inserted columns – ‘Original ID’, ‘New ID’, ‘Day’, ‘Updated Addresses’ and ‘Time Period’. This sheet will be used for analysis purposes. Please refer to Figure 1 below for the revised metadata.

(ii) 2.0 Original Data This sheet contains the original, raw data given, with the exception of the row ‘Original ID’ that was inserted for tracking purposes. There is a total of 14 columns in this sheet, inclusive of ‘Original ID.’ Please refer to Figure 1 below for the revised metadata.   (iii) 3.0 Appendix & Notes The purpose of this sheet is to highlight to any reader on the changes made to the original data set to allow for better comprehension of the data cleaning process. It contains notes relating to data points that were duplicated or removed.

(iv) 3.1 Cross Checking This sheet is used internally for our cross checking between ‘1.0 Cleaned Data’ and ‘2.0 Original Data’ to ensure that no error occurred when duplicating the data. Using the ‘LOCATION’ column which contains all unique entries of addresses, we cross checked to ensure that all the entries in the ‘1.0 Cleaned Data’ are found in the ‘2.0 Original Data’ and vice versa.


2.1 Revised Metadata

<INSERT TABLE HERE>

2.2 Summary Statistics for '1.0 Original Data'

Group08 oBike Summary Statistics 1.png
Figure 2: Summary Statistics for Original Data (JMP PRO)

As shown in figure 2 above, although there is a total of 3,014 rows present, a significant number of rows had missing fields. In particular, four columns had a strikingly high number of blanks – ‘Completed Time’, ‘Duration’, ‘# of Bikes’ and ‘Arrange To. Of these, the ‘# of Bikes’ column had the highest percentage of blanks i.e. 99.6%. Reasons and consequences of such blanks will be further discussed in section 3.0 below.

3.0 Data Quality Issues & Consequences

This section will elaborate in detail the data quality issues faced by the team, as well as the consequences and limitations resulting from these issues. In sum, the data quality issues faced by the team can be broken down into five broad categories – Addresses, No. of Bikes, Authority, Status and General/Miscellaneous.


3.1 Original Address / Location

Although the ‘Original Address’ column had zero blanks, it was plagued with numerous data quality issues which severely hindered our data cleaning process.


3.1.1 Vague Descriptions
Many entries contained extremely vague descriptions of locations located in Singapore. Such vague locations would render it impossible to plot an accurate location of the illegal parking cases on a map. Vague descriptions inevitably mean that Google API will return pre- pinned longitude and latitude coordinates for such cases. In other words, Google API returns the exact same geographical coordinates even if the actual location of the bicycles were different. Consequently, when translating these coordinates onto a map, these points will overlap/combine into one distinct point, which is visually misleading to the user.

Group08 oBike Google Map.png
Figure 3: Visual Representation of East Coast Park Singapore


For instance, ‘East Coast Park’ is a beach park stretching from Marina East to Bedok planning areas in Singapore and covers 185 hectares of land. Therefore, tickets could have been issued anywhere in this area between Marina East Drive and Water Venture Coast (see Figure 3 above for visual representation). However, feeding the location ‘East Coast Park Singapore’ to Google API will only generate longitude and latitude coordinates of 103.9121866 and 1.3007842 respectively. This means that vague descriptions have an inherently high margin of error. Therefore, vague descriptions restrict our ability to derive meaningful insights from analysis. Other vague descriptions include but are not limited to ‘Bedok Reservoir’, ‘Coney Island’, ‘Bishan-Ang Mo Kio Park’, and ‘Whole stretch of Admiralty Street’. Further, from an operational perspective, the more specific the location, the easier and faster it will be for oBike to track down the bicycles for redeployment.

Further, from an operational perspective, the more specific the location, the easier and faster it will be for oBike to track down the bicycles for redeployment.


3.1.2 Overly Specific Descriptions

In stark contrast to the above, there are also descriptions of locations which are overly specific. Overly specific descriptions hinders the efficiency and ability of Google API to return an accurate geographical coordinate. For example, although descriptions such as ‘Left at rubbish chute 4 and behind washing bay of Blk 227 Serangoon Ave 4’ may be useful in helping oBike find illegally parked bicycles, feeding such lengthy descriptions to Google API interferes with its ability to pick up keywords, so chances are, it may return zero results. In other cases, Google API may pick up the wrong keywords and therefore return inaccurate geographical coordinates.


3.1.3 Multiple Locations

Some of the ‘location’ fields contain not one, but multiple addresses. In most cases, the addresses are in proximity to one another. For instance, although there are three distinct locations in the description ‘Blk 713, 714 and 716 Pasir Ris Drive 3’, these locations are adjacent to one another. In other instances such as ‘Swettenham Road / Ridout Road / Peel Road / Peirce Road’, the locations are situated on different roads that are further apart. Most times, if Google API receives more than one location, it will simply return the geographical coordinates of the first. In the case of the former, where multiple locations are adjacent to one another, this does not pose much of a problem as the longitude and latitude coordinates are still somewhat accurate. Put simply, Blk 713 Pasir Ris Drive 3’s geographical coordinates are similar to that of Blk 714 and 716 due to the close proximity of the locations. However, for the latter case, if Google API were to only return me the coordinates of the first location i.e. Swettenham Road, this coordinates may differ significantly from the remaining locations (Pierce Road), thereby returning unrepresentative results.


3.1.4 Junctions

Some of the addresses provided refer to junction between roads, with no particular landmark given. Such locations are recorded in the following format ‘Junction between Road X and Road Y’. For a handful of these entries, Google API is able to recognise the term ‘junction’ and therefore returns the coordinates of the junction between roads. However, it is noticed that its ability to do so is largely inconsistent. Most times, Google API picks up only the first road name, and returns the pre-pinned coordinates for that particular road. Thus, the outputs of Google API will contain erroneous results if fed such entries without first cleaning them.


3.1.5 Spelling Errors & Acronyms

It was also apparent that some of the entries contained spelling errors due to human error. For instance, ‘Bedok Reservoir’ was misspelled as ‘Bedok Resevoir’ and ‘Buona Vista’ was spelt as ‘Buana Vista’. Although Google API was still able to still detect the locations for some of these descriptions, the output was usually inconsistent, thereby raising the occurrences of inaccurate outputs. Additionally, several entries contained acronyms such as ‘ECP’ and ‘CCK’, which stands for East Coast Park and Choa Chu Kang respectively. Although these terms are commonly used by Singaporeans to identify locations, Google API is unable to recognise majority of such terms and therefore will not return any geographical coordinates. In addition, less commonly used acronyms were also used, such as ‘BNA’, ‘MSCP’, ‘PRP’, which stands for Bedok North Avenue, Multi-storey Carpark, and Pasir Ris Park respectively. For such terms, data cleaning was absolutely necessary in order to retrieve the latitude and longitude coordinates.


3.2 Number of Bikes

As seen in Figure 2, the ‘# of bikes’ column had the highest absolute number and percentage of blanks. Upon closer look, it was observed that for some entries, the number of bikes were given, but was recorded in the ‘location’ column together with the addresses instead. Nevertheless, for majority of the entries, information regarding the number of bicycles were indeed missing. The rationale behind this is that authorities issuing the fines almost never notify bicycle-sharing companies such as oBike about the number of bicycles present in a location. Accordingly, on oBike’s end, they are unable to record information which they do not possess. Authorities claim there are frequently too many bicycles in given location by all three operators – Ofo, Mobike and oBike – so it becomes too time consuming to manually count the number of bicycles belonging to each company.

Admittedly, however, this information is crucial from a business standpoint. As the bicycles are currently fined on a per bike basis, the number of bicycles determines the monetary value of the fines and therefore, would govern the operations pertaining to oBike’s bicycle redeployment strategy. On our part too, it is impossible to accurately predict ‘hotspots’ in terms of the dollar amount of fines oBike is receiving. This lack of information is considered a severe limitation of this project. For instance, theoretically, if oBike receives two fines simultaneously – Clementi (30 bikes) and Ang Mo Kio (1 bike), it makes more business sense for them to assign their limited resources to address the ticket containing more bicycles first i.e. Clementi, as the value of fines would be $15,000 and $500 respectively. Without knowing the number of bikes in each ticket, oBike would be lacking key information. In other words, they are currently assigning their limited resources blindly.


3.3 Authority

Although this column was mostly filled up, three entries remained blanks. One of them had a high number of bicycles (80) in the ticket issued. Interestingly, it was also noted that apart from ‘LTA’, ‘NParks’ and ‘TC’, there was also another category labelled as ‘Others’ that contained only one entry. As the value of fines vary depending on the authority who issues it e.g. LTA fines $500 per bike while NParks does not yet issue fines, the authority field is also important as it determines the urgency of responding to fines. The higher the potential amount of fines, the more urgent the ticket becomes.


3.4 Status

Status typically falls in one of the following categories – ‘ignore’, ‘arranging’ and ‘completed’. Despite the low percentage of blanks for this column, it is observed that 995 out of 3039 entries (32.74%) are classified as ‘arranging’. This suggests that oBike does not consistently update the status of the ticket response. Accordingly, it is difficult to draw insights on the efficiency of their third party contractors.


3.5 Codes

Codes are meant to enable oBike to uniquely identify each ticket received. However, for this dataset, the code is not unique. As shown in figure 2 above, there a 3013 instances of codes, but only 2320 categories. This suggests that some codes are not unique. A unique identifier helps to trace the progress and response of each ticket. For LTA tickets, the codes do represent a unique identifier. However, for tickets issued by NParks and Town Council, the codes are not unique. For one, codes for tickets by TC only represent an area which is governed by a particular town council e.g. ‘Clementi’, ‘Woodland’, ‘AMKTC’. These codes have multiple entries under them, making it difficult to track individual ticket cases due to the lack of a unique identifier.

4.0 Data Cleaning & Preparation

In light of the aforementioned data quality issues, data cleaning had to be undertaken to minimise the negative consequences that may arise. This section elaborates further on the steps taken during the data cleaning and preparation process.


4.1 Manual Cleaning of Addresses

Considering that the location field posed numerous data quality issues, coupled with the fact that Google API requires the location field to be in a relatively neat and standardised format, several measures had to be taken to tidy up the locations.


4.1.1 Adding Singapore

Some road names and landmarks in Singapore have similar names to other locations in other countries, e.g. Arthur Road can not only be found in Singapore but also in the UK and India. That said, it is important to add the term ‘Singapore’ behind each location to ensure that the output received relates to a distinct place in Singapore, and not elsewhere.


4.1.2 Correcting Acronyms and Spelling Errors

For descriptions containing acronyms and spelling errors, we had to correct them accordingly. For instance, ‘Blk 628 HG Ave 8’ would have to be changed to ‘Blk 628 Hougang Ave 8’. Via the excel find and replace function, this could be done to multiple entries at once, thereby speeding up the process.


4.1.3 Splitting Multiple Locations

For entries with multiple addresses, it was necessary to split them into individual locations depending on how far apart these locations are. For instance, with reference to Figure 4 below, there are three locations captured in a single entry – Springwood Crescent, Springwood Avenue and Jalan Mat Jambol. However, as Springwood Crescent and Springwood Avenue and streets adjacent to each other, they are kept together as a single entry. On the other hand, Jalan Mat Jambol is further away and hence, requires us to split the entry into two. Consequently, one entry with three distinct locations are split into two.

To ensure consistent splitting of entries, an arbitrary benchmark was set at three decimal places in the longitude or latitude coordinates. In other words, if the third decimal place of the longitude and latitude coordinates differ, then the entries are split up. Differences in the third decimal place for longitude and latitude coordinates means that the locations are more than 110m apart in radius. However, the data currently still requires further refinement as not all points are yet consistent with this benchmark.


4.1.4 Removing Excessive Descriptions

Since excessive descriptions interfere with Google API’s ability to produce accurate outputs, they have to be manually removed from the location fields. For instance, ‘Bus stop 11531. Alexandra. Near IKEA region. Pls remove these asap and report back to me when it is done’ will simply be changed to ‘Bus stop 11531 Singapore’. Otherwise, Google API may pick up the keyword ‘IKEA’ and return the coordinates for ‘IKEA Tampines’ instead. Removing excessive descriptions helps to improve both the efficiency and accuracy of Google API. In addition, it was noticed that in some entries, authorities did provide information regarding the number of bikes, but this information was recorded in the ‘location’ column instead – e.g. ‘2 oBike infront 368 Thomson Condo along Jalan Raja Udang’. For such rows, we would delete the excess descriptions, and insert the number of bikes into the ‘# of bikes’ column instead. Put simply, we will manually fill in the ‘# of bikes’ in the event where information is provided.


4.1.5 Pinning Junctions

Group08 oBike Pinning Junctions.png
Figure 5: Manually Pinning Junctions


To resolve the inherent problems involved with feeding junctions between two roads to Google API as discussed in section 3.1.4, we first had to filter out addresses containing the keyword ‘junction’. One of such locations would be ‘Junction of Balestier Rd and Jln Kemaman’. Using the road names, we would then use google maps on web browser to produce an image as shown by Figure 5 above.


From the image, it is then possible to visualise the junction between the roads. Following which, we will manually pin the location as shown in figure 5. Doing so will return the latitude and longitude coordinates; in this case they are 1.325717 and 103.849777 respectively. Having these coordinates, we will then input them into the csv file accordingly. Although there are still flaws present with this method i.e. pinning in slightly to the left or right will produce different coordinates, it is nonetheless still more accurate than simply allowing Google API to generate the geographical coordinates. Hence, the accuracy is still improved.


4.1.6 No Landmark Locations

Due to the nature of the point to point transport, some bicycles are parked in areas with no specific landmarks, or are located between landmarks. For instance, ‘between Blk 126 and Blk 124 Simei Street’ implies that the bicycle was neither parked at Blk 126 or 124, but instead was found somewhere between these blocks. For such cases, we follow the method described above (section 4.1.5) to check if there is indeed no landmark present. If so, we will then simply take the first address recorded i.e. Blk 126 Simei Street.


4.1.7 Vague Descriptions

As mentioned, vague descriptions result in inaccurate and sometimes even misleading results. However, due to the lack of information and nature of the data issue, it is impossible to have improve the current descriptions unless authorities decide to be more specific with their descriptions. Ergo, there is nothing that can be done short of asking oBike to request for better quality descriptions. Such locations are simply fed to Google API to obtain the pre-pinned geographical coordinates.


4.2 Bus Stop Codes

As bus stops codes represent unique identifiers for bus stops around Singapore, it is possible to get very specific longitude and latitude coordinates directly from LTA Data Mall. The rationale for using LTA’s data mall over Google API is that the former contains the most updated and relevant information in Singapore, and should therefore produce better quality outputs. An R-code is the used to speed up the process of retrieving the geographical coordinates for all bus stop locations. Please refer to Appendix Section 8.1 for the r-script used.


4.3 Geocoding with Google API

For the remaining of the data points with no geographical coordinates obtained, an r-script was ran via rStudio to retrieve the longitude and latitude values. However, it was noticed that even after cleaning up the data, many of the entries still returned ‘NA’ as output. Moreover, it was noted that even specific addresses with postal codes could result in a ‘NA’ output (refer to figure 6 for a screenshot of the raw output by Google API). This was in fact due to a problem inherent with Google API. Without a commercial license, Google API has a query limit that prevents it from retrieving the coordinates. Therefore, the r-script was revised to include a loop such that if the output is ‘NA’, it will re-run the same entry three more times. This revision greatly improved the efficiency of the code. Refer to Appendix Section 8.2 and 8.3 for the original and revised geocode r-scripts.

4.4 Reverse Geocoding

Lastly, as a checking step, an r-script containing a reverse geocode was used. The reverse geocode works by taking the longitude and latitude coordinates generated by Google API to produce an address. Since each longitude and latitude is certain to have a corresponding address, the r-script was written to loop until the an address is returned. This address was then checked with the original location to ensure that it matches. This steps is to try and reduce the inconsistencies present in Google API. Refer to Appendix Section 8.4 for the reverse geocode r-script.

5.0 Exploratory Data Analysis and Interim Findings

5.1 Revised Summary Statistics

After data cleaning and preparation, summary statistics was once again ran to get an overview of the cleaned data. This is shown in Figure 7 below.


Group08 oBike Summary Statistics 2.jpg
Figure 7: Revised Summary Statistics for Cleaned Data (JMP Pro)


In comparison to the summary statistics performed on the original data set given by oBike (see figure 2), it should be noted that the cleaned data set now has 17 columns instead of the original 13 columns. Some column titles have been edited e.g. ‘Location’ is now referred to as ‘Original Address’. Columns that have been omitted include ‘remarks’, while new columns inserted include ‘Original ID’, ‘New ID’, ‘Updated Address’, ‘Time Period’, Longitude’ and ‘Latitude’.


More specifically however, there are additional things to note. Firstly, the number of entries have now increased to 3041 due to the splitting of rows with multiple addresses. As such, there are more data points now than before. Secondly, ‘Reported Time’, ‘Completed Time’ and ‘Due Time’ have been changed to time format in order to perform appropriate analysis. Thirdly, the ‘# of bikes’ columns have now increased from 12 to 266. This increase comes from the data cleaning process (section 3.1.4), whereby number of bikes were included in the ‘location’ column instead. It is observed that although the increase is significant, a large proportion of the fields are still blanks. Therefore, this column is not used in the current analysis. Lastly, in the original data file, there were some dashes, ‘NA’s and ‘Nulls’. These were all standardised to blanks instead.


6.0 Going Forward

Although a significant amount of work has been done till date, there remains a lot still to be done. This section will cover the revised scope of work and methodology going forward.


6.1 New Data Sets

It was revealed that authorities issue tickets via a few methods including Whatsapp and E-mails. Additionally, the ‘OneService’ mobile application allows residents to directly report illegal parking cases that occur in their neighbourhood; these fall under tickets issued by the Municipal Services Office, MSO for short.


6.1.1 Additional Data Points

oBike records their tickets received via Whatsapp in one csv file, and those via e-mails and MSO on another csv file. Thus, it came to our attention that we have only received the former and not the latter file due to oversights from oBike. Also, we have requested for more recent data from the month of January. As such, we have since received two additional csv files from oBike, and are in the midst of cleaning them. Analysis derived from these additional entries will be presented during the final phase.


6.1.2 Yellow Boxes Coordinates

To help us going forward, oBike has shared with us a file containing the coordinates of all yellow boxes in Singapore. However, as new yellow boxes are being painted each day, this file is subject to changes. Nevertheless, with the yellow boxes coordinates, it is possible to derive more in-depth analysis of greater scope. This will also enable us to answer objective 3 i.e. where should oBike place yellow boxes around Singapore.


6.2 New & Better Codes

With the additional data points given to us, the data cleaning process will have to be repeated. However, for the upcoming iteration, we hope to improve the data cleaning process via two ways – improving efficiency and improving accuracy.


6.2.1 Improving Efficiency

Although the r-script for the geocoding process has already been revised, we are in the midst of creating a new r-script such that it will be able to exit the existing session and begin a new one each time a ‘NA’ result is obtained. The rationale behind this is to avoid the query limit and reduce the number of ‘NA’s returned.


6.2.2 Improving Accuracy

It has been observed that certain entries contain lamppost numbers in Singapore. However, our current code and Google API does not enable us to track the lampposts to retrieve their longitude and latitude values. As such, we are currently looking for alternative ways to clean addresses with lamppost numbers. If this can be done, the results obtained will be more accurate.


6.3 Revised Scope of Work

At the beginning on this project, oBike had requested for us to focus on one particular area, and analyse the user routes (i.e. start points and end points) pertaining to bicycles in the chosen area. This would help us to determine where the yellow boxes should be placed. However, it has recently been mentioned that the data set containing bicycle routes is highly private and confidential. Consequently, there is a chance that we cannot be privy to this dataset. Hence, the scope of work may be revised once more. We are currently in the midst of discussions.

7.0 Conclusion

In conclusion, keeping in mind that the current analysis is not run using the complete data set, we have to take the interim findings with a pinch (or rather, handful) of salt. Till date, the greatest problem faced would be regarding the vague descriptions as there is nothing we can do to improve it on our part. In light of this, it is highly recommended that operationally, oBike should approach or request that the authorities be more specific with their descriptions, or better yet, provide them with the coordinates directly. Otherwise, it is also difficult for oBike to cooperate even if they wanted to. That said, this also highlights the importance of data quality. As with many start-up companies, the data collection process is not yet standardised and are still in the midst of improving. This therefore results in a lot of data cleaning and manual work on the back end. Thus going forward, oBike should think of better and more innovative solutions to record and store such important business information. Albeit the cost involved, there is definitely rewards to be reaped.