ANLY482 AY2017-18T2 Group08 : Project Findings
Interim | Final |
Contents
1.0 Project Recap
oBike, Singapore’s first home-grown stationless bicycle sharing company, began their operations in January 2017. However, in recent months, Singapore’s Land and Transport Authority (LTA) issued new rules and regulations that require bicycles to be parked in designated yellow boxes around the island. LTA enforcers, together with authorities from Town Council and NParks, survey the island, and issue tickets to bike-sharing companies in the event where bicycles are found to be outside of these yellow boxes. From the time a ticket is issued, oBike has a mere four hours to move their illegally-parked bicycles. Failure to do so will incur hefty fines.
As such, this practicum seeks to achieve the following objectives:- (i) Identify hotspots for illegal parking cases (ii) Project the illegal parking patterns by analysing historical data (iii) Determine suitable areas for yellow boxes to be painted To achieve the above objectives however, we had to first clean the data given and perform exploratory data analysis (EDA). That said, this interim report seeks to document the data cleaning process as well as EDA performed thus far. In addition, any key insights derived till date will also be shared.
2.0 About the Data
The csv files titled ‘Group08_oBike_InterimData’ contains four sheets with descriptions as follows:-
(i) 1.0 Cleaned Data Cleaned data refers to data that has already been cleaned via our data cleaning process, which will be described further in Section 4. The format for ‘1.0 Cleaned Data’ is similar to the original data given by oBike, except there are five newly inserted columns – ‘Original ID’, ‘New ID’, ‘Day’, ‘Updated Addresses’ and ‘Time Period’. This sheet will be used for analysis purposes. Please refer to Figure 1 below for the revised metadata.
(ii) 2.0 Original Data This sheet contains the original, raw data given, with the exception of the row ‘Original ID’ that was inserted for tracking purposes. There is a total of 14 columns in this sheet, inclusive of ‘Original ID.’ Please refer to Figure 1 below for the revised metadata. (iii) 3.0 Appendix & Notes The purpose of this sheet is to highlight to any reader on the changes made to the original data set to allow for better comprehension of the data cleaning process. It contains notes relating to data points that were duplicated or removed.
(iv) 3.1 Cross Checking This sheet is used internally for our cross checking between ‘1.0 Cleaned Data’ and ‘2.0 Original Data’ to ensure that no error occurred when duplicating the data. Using the ‘LOCATION’ column which contains all unique entries of addresses, we cross checked to ensure that all the entries in the ‘1.0 Cleaned Data’ are found in the ‘2.0 Original Data’ and vice versa.
2.1 Revised Metadata
2.2 Summary Statistics for '1.0 Original Data'
As shown in figure 2 above, although there is a total of 3,014 rows present, a significant number of rows had missing fields. In particular, four columns had a strikingly high number of blanks – ‘Completed Time’, ‘Duration’, ‘# of Bikes’ and ‘Arrange To. Of these, the ‘# of Bikes’ column had the highest percentage of blanks i.e. 99.6%. Reasons and consequences of such blanks will be further discussed in section 3.0 below.
3.0 Data Quality Issues & Consequences
3.1 Original Address / Location
3.1.1 Vague Descriptions
3.1.2 Overly Specific Descriptions
3.1.3 Multiple Locations
3.1.4 Junctions
3.1.5 Spelling Errors & Acronyms
2.2 Number of Bikes
2.3 Authority
2.1 Status
2.1 Codes
4.0 Data Cleaning & Preparation
5.0 Exploratory Data Analysis and Interim Findings
6.0 Going Forward
7.0 Conclusion