ANLY482 AY2017-18T2 Group31 Project Data
Data Quality Analysis
Data quality analysis highlights the following issues present in the dataset provided:
Inconsistent Reporting
In the data, we observed that the data were not consistent when they are reported. We have observed some records that are repeated, which have same location, same reported time and same date. The only difference is CODE column. Another reporting behavior was identified. In this case, a record represents more than one bikes found in the location. Inconsistent reporting can be a limitation in the analysis as certain individual records might potentially carry more weight due to the presence of more bikes found, but overlooked because they only exist in one record. In our analysis, all records are treated evenly.
Ambiguous and Meaningless data
Among the records of data provided, there was a significant number of data that did not value add to our analysis at this stage. Such cases of ambiguous data found accounts for 5.3% of total data.
As more data are collected in the future, such data will not be meaningful to our sponsor to conduct further analysis, and also it will not aid the removal and clearance of indiscriminately parked bikes.
Too little data
In total, we only have 3014 rows of data over a period of less than 3 months. The date also only ranges from 18 November 2017 to 2 January 2018, where we only have December 2017 with a whole month worth of data. This prevents us on comparing data between months. In addition, the analysis might be skewed as it contributed more from data points for December.
Short forms
Certain addresses are given in short forms and this made it hard for us to interpret the address of the reports and hard to be read by GeoCode. Although this is not a big issue, it creates confusions for us who are not familiar with these short forms.
Lack of unique identifier
Unsuitable Time Format
Our sponsor records time in the format hhmm using a general numeric data type. Also, bikes that were relocated after the day of report (for eg relocated on 25 Nov 17 although reported on 24 Nov 17), our sponsor uses the general format of hhmm(dd/mm). These two factors makes it hard for us to make better use of the data.
Data Preparation
The following data preparation efforts were done:
Location - Added Category, Street, Other details
We categorised the data into different groups based on their location. The categories are of below:
HDB, BUS STOP, MRT STATION, LANDED, PARKS & RESERVOIR, CONDO, COMMERCIAL, MALL, HOSPITAL, SCHOOL, COMMUNITY CENTRE
There are also features to give information for street name and descriptions of the location. STREET and OTHER DETAILS were created, where we sieve out information from raw location given, and state them accordingly in the new features.
Location - Excel GeoCode
We make use of World Geodetic System (WGS84), a reference coordinate system that is used by the Global Positioning System (GPS) to obtain the latitude and longitude coordinates, using Excel VBA and Google Maps Geocoding API to convert the street address into latitude and longitude which can be positioned onto a map.
Day of the week
Used JMP to create a new column which will generate number from 1 - 7 each representing a day of the week. The values were then recoded to words to make it easier to understand. 1 being Sunday, and 7 being Saturday.