ANLY482 AY2017-18T2 Group08 : Project Overview / Methodology

From Analytics Practicum
Revision as of 16:35, 14 January 2018 by Desireeseah.2014 (talk | contribs)
Jump to navigation Jump to search

Homepage

Our Team

Project Overview

Project Findings

Project Management

Documentation

ANLY482 AY2017-18 T2 Projects

Description Data Methodology

6.0 Methodology

6.1 Data Collection

Data will be provided to us directly by the client. oBike extracts the relevant data from their database and transfers it to us in a zip folder via e-mails. Due to privacy issues, the data is locked and requires a password to access. The first half of the password is sent to us through e-mails, whereas the second half of the password is given to us through a phone call from our sponsor. Certain data files may also be encrypted and will require an encryption key to unlock.

Data concerning illegal parking will be given early on in the project phase, to enable us to start working on objectives 1 and 2. Only after achieving these objectives will the client provide us with more granular data regarding specific bicycle routes. This data is then used to tackle objective 3.

In addition to that, our team is aware that the data set given consists of only reported cases. In other words, it is not a full representation of the actual occurrence of illegally parking. Hence, some on-ground observation is required to reinforce the current data.


6.2 Data Preparation & Cleaning

Prior to data analysis, data preparation and cleaning are necessary. The data requires data cleaning to handle missing values, outliers and to standardize the format of data for analysis.

In the dataset given, there are missing fields such as absence of case ID, number of bikes per ticket issued and duration of bike collection from ticket issuance. We would have to determine if missing values can be replaced by predicting these missing values according to other information provided. Otherwise, we would have to omit such data from our analysis.

Following this, we will have to check for the presence of outliers, if any. After which we will examine the significance of these outliers by running our analysis twice – first with the outliers, and the second without them. A comparison of the two results obtained will then be done to understand how they differ. Depending on the results and the quantity of such outliers would we then decide if data transformation is necessary to control the outliers.

As an additional step of the data preparation process, standardisation of the 'location' field has to be performed. Currently, the location field is a description of where the illegally-parked bicycles are found. This description ranges from detailed addresses with postal codes to simple and vague descriptions of landmarks and areas. As such, standardisation of this field will need to be done in order for us to perform subsequent analysis. Please refer to section '4.4 Geospatial Analysis' below for a more in-depth discussion of this process.


6.3 Exploratory Data Analysis (EDA)

To begin, EDA will be performed on the given data set to help us better understand the data provided as it provides us with a summarised view of the data. EDA has the potential to enable us to derive meaningful insights beyond the formal modelling or hypothesis testing stage. EDA may also help us to formulate hypotheses. For instance, we may be able to hypothesise which areas are more prone to illegal parking, or which days of the week are the authorities most active. Such EDA will include an exploration of various the time, location, date etc. Key findings for this exploratory phase will then be presented in a form of a dashboard for visualisation purposes. Tools used in this stage may include Microsoft Excel, Tableau and SAS.


6.4 Descriptive Analytics – Geospatial Analysis

Given that the ‘location’ field of the data currently comprises lengthy text descriptions that differ in style from one another, the first step will be to standardise the data. To do so, we will allocate area codes, remove and re-type locations into a description that Google API will be able to understand. For instance, 'ECP' will be changed to 'East Coast Park Singapore'. In addition, we will attempt to use LTA’s data mall to obtain specific codes for locations. This standardisation is necessary as we will be using Google API to geocode the addresses into latitude and longitude coordinates to be presented on a map. RStudio will be used to automate this process of generating coordinates. Descriptions that are too vague (e.g. East Coast Park) may be omitted.

Once these coordinates have been obtained, we can then perform geospatial analysis to determine which areas in Singapore have the greatest number of cases of illegal parking of bicycles. These coordinates can also be plotted in Tableau for better visualisation purposes that may aid us in spotting any trends present. In turn, this helps us to achieve objective 1 and is also important in helping fulfil the remainder of the objectives. Subsequent analysis will only take into account areas which we have determined to be ‘hotspots’ for illegal parking i.e. where illegal parking is most prevalent.


6.5 Predictive Analytics – Time Series Analysis & Forecasting

After identification of areas with a higher risk of illegal parking, time series analysis will be performed on these specific areas. The objective here is to discover a pattern in the historical data and extrapolate that pattern into the future. This analysis is based on the maxim that history tends to repeat itself. Consequently, we will attempt to predict the amount of illegal parking cases based on the day of the week and the time of day.

To do so, the first step of time series analysis is to construct two univariate time series plots, which is a graphical representation of the relationship between the following: (a) Day of Week and Number of illegal parking cases
(b) Time of Day and Number of illegal parking cases

Other statistical measures to be reported include mean, median, maximum, minimum and standard deviation of the number of illegal parking cases. Time series patterns that may be present in our data include horizontal patterns, trends, seasonality or cycles. We will also be able to determine whether it is a stationary time series or a non-stationary time series.

Depending on the above, our forecasting techniques will differ. Forecasting techniques may include moving averages, weighted moving averages and exponential smoothing. The specific technique to be used will only be known after the time series analysis has been conducted.


6.6 Prescriptive Analytics – Decision Making

Following the forecasting of areas with high tendency for illegal bike parking in the future, we will identify specific locations in need of additional yellow boxes to be painted by bike-sharing companies so as to reduce such occurrences. This will be decided based on the tendency of LTA tickets to be issued in the location, demand for bikes in the location, as well as the tendency for riders to leave bikes in the location. Other considerations include, but are not limited to, existing designated bike parking locations available in the area. More details on possible considerations will only be known after greater analysis has been done.   

7.0 Limitations

oBike is a relatively new company and as such, they do not yet have extensive data collection measures in place. Consequently, we are only provided LTA ticket issuance data for the period of November to December 2017 and often, data from the last quarter of the year tends to differ from the rest as it clashes with the holiday season. This hinders our analysis as it becomes difficult to analyse and forecast annual trends. To help overcome this, our team will put forth a request to obtain data from May 2017 onwards, as well as upcoming months i.e. January/February 2018.

Nevertheless, given that the bike-sharing industry is relatively new in Singapore, it is highly volatile in nature. Ergo, older data might not be very representative of existing trends. Thus, the use of the most recent data for analysis, forecasting and prescriptive measures might be more suitable for such an industry.

Further, upon reviewing the data, we observe that ‘# of Bikes’ column has a significant amount of missing values. oBike has explained that this occurs because LTA does not always inform them of the exact number of bicycles included in each parking ticket. Given that LTA has only just begun stepping up their enforcements efforts, it is of no surprise that there are still variances in their reporting formats. As such, we cannot analyse the exact number of bikes that are illegally parked. However, since one ticket is issued for one or more bikes in the same location at the same time, useful insights can still be obtained with regards to the geographical locations where illegal bike parking problems are the most prevalent.