Difference between revisions of "ANLY482 AY2017-18T2 Group08 : Project Findings"
Cl.heng.2014 (talk | contribs) m |
Cl.heng.2014 (talk | contribs) m |
||
Line 75: | Line 75: | ||
will overlap/combine into one distinct point, which is visually misleading to the user. | will overlap/combine into one distinct point, which is visually misleading to the user. | ||
− | [[File:Group08 oBike Google Map.png| | + | [[File:Group08 oBike Google Map.png|300 px|centre|]] |
− | + | Figure 3: Visual Representation of East Coast Park Singapore | |
For instance, ‘East Coast Park’ is a beach park stretching from Marina East to Bedok planning areas in Singapore and covers 185 hectares of land. Therefore, tickets could have been issued anywhere in this area between Marina East Drive and Water Venture Coast (see Figure 3 above for visual representation). However, feeding the location ‘East Coast Park Singapore’ to Google API will only generate longitude and latitude coordinates of 103.9121866 and 1.3007842 respectively. This means that vague descriptions have an inherently high margin of error. Therefore, vague descriptions restrict our ability to derive meaningful insights from analysis. Other vague descriptions include but are not limited to ‘Bedok Reservoir’, ‘Coney Island’, ‘Bishan-Ang Mo Kio Park’, and ‘Whole stretch of Admiralty Street’. | For instance, ‘East Coast Park’ is a beach park stretching from Marina East to Bedok planning areas in Singapore and covers 185 hectares of land. Therefore, tickets could have been issued anywhere in this area between Marina East Drive and Water Venture Coast (see Figure 3 above for visual representation). However, feeding the location ‘East Coast Park Singapore’ to Google API will only generate longitude and latitude coordinates of 103.9121866 and 1.3007842 respectively. This means that vague descriptions have an inherently high margin of error. Therefore, vague descriptions restrict our ability to derive meaningful insights from analysis. Other vague descriptions include but are not limited to ‘Bedok Reservoir’, ‘Coney Island’, ‘Bishan-Ang Mo Kio Park’, and ‘Whole stretch of Admiralty Street’. |
Revision as of 00:59, 26 February 2018
Interim | Final |
Contents
1.0 Project Recap
oBike, Singapore’s first home-grown stationless bicycle sharing company, began their operations in January 2017. However, in recent months, Singapore’s Land and Transport Authority (LTA) issued new rules and regulations that require bicycles to be parked in designated yellow boxes around the island. LTA enforcers, together with authorities from Town Council and NParks, survey the island, and issue tickets to bike-sharing companies in the event where bicycles are found to be outside of these yellow boxes. From the time a ticket is issued, oBike has a mere four hours to move their illegally-parked bicycles. Failure to do so will incur hefty fines.
As such, this practicum seeks to achieve the following objectives:- (i) Identify hotspots for illegal parking cases (ii) Project the illegal parking patterns by analysing historical data (iii) Determine suitable areas for yellow boxes to be painted To achieve the above objectives however, we had to first clean the data given and perform exploratory data analysis (EDA). That said, this interim report seeks to document the data cleaning process as well as EDA performed thus far. In addition, any key insights derived till date will also be shared.
2.0 About the Data
The csv files titled ‘Group08_oBike_InterimData’ contains four sheets with descriptions as follows:-
(i) 1.0 Cleaned Data Cleaned data refers to data that has already been cleaned via our data cleaning process, which will be described further in Section 4. The format for ‘1.0 Cleaned Data’ is similar to the original data given by oBike, except there are five newly inserted columns – ‘Original ID’, ‘New ID’, ‘Day’, ‘Updated Addresses’ and ‘Time Period’. This sheet will be used for analysis purposes. Please refer to Figure 1 below for the revised metadata.
(ii) 2.0 Original Data This sheet contains the original, raw data given, with the exception of the row ‘Original ID’ that was inserted for tracking purposes. There is a total of 14 columns in this sheet, inclusive of ‘Original ID.’ Please refer to Figure 1 below for the revised metadata. (iii) 3.0 Appendix & Notes The purpose of this sheet is to highlight to any reader on the changes made to the original data set to allow for better comprehension of the data cleaning process. It contains notes relating to data points that were duplicated or removed.
(iv) 3.1 Cross Checking This sheet is used internally for our cross checking between ‘1.0 Cleaned Data’ and ‘2.0 Original Data’ to ensure that no error occurred when duplicating the data. Using the ‘LOCATION’ column which contains all unique entries of addresses, we cross checked to ensure that all the entries in the ‘1.0 Cleaned Data’ are found in the ‘2.0 Original Data’ and vice versa.
2.1 Revised Metadata
2.2 Summary Statistics for '1.0 Original Data'
As shown in figure 2 above, although there is a total of 3,014 rows present, a significant number of rows had missing fields. In particular, four columns had a strikingly high number of blanks – ‘Completed Time’, ‘Duration’, ‘# of Bikes’ and ‘Arrange To. Of these, the ‘# of Bikes’ column had the highest percentage of blanks i.e. 99.6%. Reasons and consequences of such blanks will be further discussed in section 3.0 below.
3.0 Data Quality Issues & Consequences
This section will elaborate in detail the data quality issues faced by the team, as well as the consequences and limitations resulting from these issues. In sum, the data quality issues faced by the team can be broken down into five broad categories – Addresses, No. of Bikes, Authority, Status and General/Miscellaneous.
3.1 Original Address / Location
Although the ‘Original Address’ column had zero blanks, it was plagued with numerous data quality issues which severely hindered our data cleaning process.
3.1.1 Vague Descriptions
Many entries contained extremely vague descriptions of locations located in Singapore. Such vague locations would render it impossible
to plot an accurate location of the illegal parking cases on a map. Vague descriptions inevitably mean that Google API will return pre-
pinned longitude and latitude coordinates for such cases. In other words, Google API returns the exact same geographical coordinates
even if the actual location of the bicycles were different. Consequently, when translating these coordinates onto a map, these points
will overlap/combine into one distinct point, which is visually misleading to the user.
Figure 3: Visual Representation of East Coast Park Singapore
For instance, ‘East Coast Park’ is a beach park stretching from Marina East to Bedok planning areas in Singapore and covers 185 hectares of land. Therefore, tickets could have been issued anywhere in this area between Marina East Drive and Water Venture Coast (see Figure 3 above for visual representation). However, feeding the location ‘East Coast Park Singapore’ to Google API will only generate longitude and latitude coordinates of 103.9121866 and 1.3007842 respectively. This means that vague descriptions have an inherently high margin of error. Therefore, vague descriptions restrict our ability to derive meaningful insights from analysis. Other vague descriptions include but are not limited to ‘Bedok Reservoir’, ‘Coney Island’, ‘Bishan-Ang Mo Kio Park’, and ‘Whole stretch of Admiralty Street’. Further, from an operational perspective, the more specific the location, the easier and faster it will be for oBike to track down the bicycles for redeployment.
3.1.2 Overly Specific Descriptions
3.1.3 Multiple Locations
3.1.4 Junctions
3.1.5 Spelling Errors & Acronyms
2.2 Number of Bikes
2.3 Authority
2.1 Status
2.1 Codes
4.0 Data Cleaning & Preparation
5.0 Exploratory Data Analysis and Interim Findings
6.0 Going Forward
7.0 Conclusion