Atom FinalWiki

From Analytics Practicum
Revision as of 13:06, 16 April 2016 by Sh.yan.2012 (talk | contribs)
Jump to navigation Jump to search














In Singapore, roads today already take up 12% of its total land area and with the limited land available. Singapore cannot afford to exhaust its land area by building more roads to accommodate vehicles and further expand the road network.

Parking space is simply the provision for the storage of vehicles. Car parks can be provided in a variety of land uses ranging from residential to shopping centers. Furthermore, car park can cause a serious impact on aesthetics whether it is on street, or in multi-storey aboveground or underground structures. These car parks consume both land and resources, that might be put to better usage in other areas, for instance, building another development or private homes.

A strategic approach to parking would connect the separate decisions of parking provision at individual sites with the achievement of wider planning goals. For instance, saving the land for other usage. A poor planning in car park would result in jams, bad traffic management and causing overspill at the surrounding areas. This is avoidable only if appropriate planning process is in placed, it helps to determine the future parking arrangement associated hence, preventing it to cause unnecessary headaches for the drivers. The main concern of planning parking activity will take note of the ways land and natural environments are conserved, valued, developed or organized using geographical understanding.

Data mining is the computational process of discovering patterns in large datasets, also known as “big data”. Whereas, for our project, the data collected are in time-series format. And time-series data is considered to be multidimensional data, as there is one observation per time unit and each time unit represents a dimension.

Parking utilization provides a time-series of typical parking demand for the development in that area that parking day. Thus, by comparing parking utilization comprehensively, the study will be able to clearly identify patterns and trends of those high and low usage car parks.

Hence, this paper seeks to explore using time-series data mining techniques to discover patterns and trends of similar car park sites within 29 shopping retail malls.


Parking requirements are the exclusive domain of local government and it is subjected to their concerns. Minimally parking requirements include four important elements, (1) the land use for the parking, (2) the car park ratio with regard to the size of the development, (3) taking into consideration of the demand and supply for the car park lots and (4) the car parks surrounding will also influence the demand required (Marsden, 2006).

Cities created off-street parking to ensure that new developments have sufficient space and ample parking (Barter, 2010). A lack of parking will result in generating traffic congestion and causing car to park and spill over to the surrounding areas. Therefore, car park planners and public officials must be able to accurately estimate the number of parking lots required in an amenities to eliminate these parking issues (Bartner, 2010).

In Singapore, roads today already take up 12% of its total land area and with the limited land available. Singapore cannot afford to exhaust its land area by building more roads to accommodate vehicles and further expand the road network.

With the increase of Singapore population in the recent years, the scarcity must be allotted wisely. As Singapore continues to grow as a city, there is a need to increase the supply for housing, industrial and office estate. Therefore, it is not a realistic plan for every Singaporean household to own a car (LTA, 2012).

Having said that, car is not a basic necessity in Singapore since public transportation is well developed and easily accessible. However, Singaporeans seem to think otherwise as the number of households in Singapore that own a car increased to 45% in 2013 from 40% in 2008. In order to curb the amount of car ownership and to ensure the roads is smooth flowing and congestion-free, the authority affirmed that it would continue to emphasize the vehicle ownership and usage restraint measure.

Since 1990, the Certificate Of Entitlement (COE) system has enabled Singapore to exercise effective control of vehicle population growth. As Singapore becomes more urbanized, the social cost of car ownership will also increase. This is because land has to be set aside for parking spaces at not only where we reside, but also at the places where we work, study and play. Allocating more land for car parks means that there is less land for other developments, such as housing, schools or healthcare facilities. On top of that, illegal parking and congestion in local neighborhoods may also become more prevalent.

With these considerations in mind, the authority would like to understand the car park occupancy situation in Singapore. Thus, the authority has requested a consultancy firm to find out more through site survey and observation in theses car parks. The information collected is transformed into knowledge with the help of the team assisting the consultancy firm in producing detailed reports, info-graphics and consolidated data for each car park sites to summarize the findings. This information is crucial to authority, as it will help it to better forward plan and handle the car park issues in Singapore.

Apart on assisting the consultancy firm in report the independent car park site situation, the team will explore and demonstrate the effective use of time-series data mining in analyzing complex data. This research study will help the authority to discover new insights on several clusters of shopping malls that are grouped together based on their similar characteristics through utilizing the car park occupancy.

Car Park Sites

The consultancy firm had completed the data collection process and compiled the results. Its primary focus is to work on the analysis and to report the findings of the 65 car park sites. Additionally, they had also created clusters by grouping the nearby car park sites together. For instance, the car parks of Punggol Plaza and Punggol 21 CC are grouped together, as they are geographically located next to each other.

The allocation of the reports required for all of these car park sites are as shown below:


As of the initial meeting on 30th December 2015, the consultancy firm had completed and submitted 10 reports to the authority. Hence, there are 45 outstanding reports, info-graphics and excel files need to be worked on. the consultancy firm’s submission deadline for the reports is 31st January 2016. Therefore, the team’s initial job scope is to assist the team in the consultancy firm in meeting its submission deadline.

After the completion of phase 1, the team will further process the data collected to coming up with new insights. Additional initiatives will include the comparison of the different car park sites, as well as, a national average representation. This will allow the business owner, to better understand the situation nationwide rather than looking at each car park site’s situation independently. Additionally, the team will also conduct a focus study using time-series clustering on shopping mall car park sites. The team would like to explore and demonstrate with the use of time-series analysis to group similar characteristics shopping mall car park sites together based on their car park occupancy. And lastly, sharing the findings and evaluate the accuracy of the analysis by linking back to the real world.

Project Objective And Business Problem

The objective of this project is to assist a consultancy firm in understanding the current parking situations in 65 different locations in Singapore. These 65 parking locations compromise of 30 retail malls, 15 retails and Food & Beverage (F&B) clusters in landed housing estates, 10 hawker centers, and 10 community clubs.

The study was conducted previously by the consultancy firm through parking occupancy surveys, human traffic counts, and interview survey at selected locations at stipulated times. The collected data will then be further processed before submitting it to the authority to understand the current parking situation at these locations. The team will be splitting the project into two phases to complete.

For phase 1, the team will be assisting the consultancy firm to analyze each car park site situation. Each parking site result collected is tabulated into a single Microsoft Excel spreadsheet file according to each survey type. Using the excel spreadsheet, it will help to generate charts and graphs for the info-graphics. Finally, a final report will share the findings of each parking site. It includes a write-up of the characteristics and methodology of the entire process, all location maps and captured images (if any). Lastly, the final report will be structured as per the format shown below:

1. Executive Summary
2. Site Background
3. Site Characteristics
4. Site Assessment
5. Survey Deployment Plan
6. Survey Findings
7. Conclusion
8. Appendices
 8.1 Site Map of the parking locations
 8.2 Car park characteristics
 8.3 Pre Survey Observations & Results
 8.4 Info-graphic to summarize the results collected
 8.5 Survey Questionnaire Template


Parking Policy

One of the most important links between land-use and transport is parking policy. The effectiveness of parking policies are often compromised due to the perceived tension among three of the objectives that parking supports: regeneration, restraint and revenue. In particular, the belief that parking restraint measures could potentially damage the attractiveness of city centers in both retail and commercial enterprises and this limits the political acceptability of pricing policies and planning (Strubbs, 2012).

Parking space is simply the provision for the storage of vehicles (Dolnick, 1999). Car parks can be provided in a variety of land uses ranging from residential to shopping centers. Furthermore, car park can cause a serious impact on aesthetics whether it is on street, or in multi-storey aboveground or underground structures. These car parks consume both land and resources, that might be put to better usage in other areas, for instance, building another development or private homes.

A strategic approach to parking would connect the separate decisions of parking provision at individual sites with the achievement of wider planning goals. For instance, saving the land for other usage (March, 2007). A poor planning in car park would result in jams, bad traffic management and causing overspill at the surrounding areas. This is avoidable only if appropriate planning process is in placed, it helps to determine the future parking arrangement associated hence, preventing it to cause unnecessary headaches for the drivers. The main concern of planning parking activity will take note of the ways land and natural environments are conserved, valued, developed or organized using geographical understanding (Aldridge et. al, 2006).

In order to achieve desirable arrangements on land use for car park, planning is essential and must establish through reiterate rules, goals, standards, designs and decision systems. In this sense, there is need for us to examine existing understanding on parking issues, as the first step to re-consider the manner in which collective action might be taken on the basis of this knowledge. Usually, these information are in the form of spatio-temporal characteristics, hence, data mining techniques are applied in order to explore the insights of car park overspill pattern.

Time-Series Data mining

Data mining is the computational process of discovering patterns in large datasets, also known as “big data”. Conventional data mining is also known as Knowledge Discovery in Database (KDD). The objective of KDD process is to extract information and transform it into knowledge, an understandable structure for future use by the business users (Frawley et. al, 1992). There are three main types of data mining techniques, which are the association rules, classification and statistical. Association rule is used to discover relations between variables in a large dataset. Classification is a data mining function that helps to generate a set of rules for classifying instances into predefined classes. Lastly, statistical data mining is driven by the data to discover new patterns and build predictive models. Although these conventional data mining techniques are broadly used by many industries, they are not appropriate for performing data mining on time-series data. Hence, another set of data mining techniques is developed to cater for time-series data, which is time series data mining (Fuller, 1995).

Time-series data mining has four major tasks: clustering, indexing, classification and segmentation (Harvey, 1994). Firstly, clustering helps to find various time-series data of the similar patterns and grouping them together. Next, indexing finds other similar time series data in order, given in a query series. Thirdly, classification assigns each time series to a known category by using the trained model that was established earlier. And lastly, segmentation separate and partitions the time series. Time-series data is considered to be multidimensional data, as there is one observation per time unit and each time unit represents a dimension. In the real world, each time series data is usually highly dimensional (Lee et. al, 2014). For instance, in a stock market setting, the data which the prices change over time can be recorded every second. In the other words, it will accumulate to be 3,600 records an hour and 86,400 records a day.

Through the collection of data on a routine basis, organizations are amassing sequentially order data. Observation and records in such dataset possess a time element in it. Accordingly, these information are collected over a period of time (i.e. over a day, a week or even up to a decade). Examples of such data types include Sales transaction, delivery orders, traffic information and etc. Over the years, businesses and organizations increasingly start to realize the importance and valuable of these data. They seek to analyze these time-series data to discover more business insights to help them improve and grow.

While presenting the data, data analyst has to put in the extra effort of transforming these high-dimensional data from time-stamped transaction into a table that is suitable for time-series application. This will help applications to identify these data as time-series data and perform further analysis and pattern detection on theses data. Data analyst has to ensure that the time-series data is transformed into a set of contiguous time instance, whereas, previously it is used to be univariate or multivariate data type.

One of the common mistakes in analyzing time-series data is that the time-stamped data is irregularly recorded, this will result in two different time-series data has been identified to have the common trends (Esling & Argon, 2012). In the case of two time-series did not occur concurrently, the application of time-series data mining techniques would not be able to discern the relationship, as time is no longer a factor in that comparison.


For example, based on the two figures shown above, Fig 1a represents the traditional data mining similarity measure using Euclidean distance. It is used to compare the similarity between the two time-series Q and C, and it is shown that the relationship is not discerned as both of them are out of phase. However, in Fig 1b, using Dynamic Time Wrapping (DTW) technique, it has overcome this issu by accounting for the time factor when comparing the two different time-series.

The development of DTW algorithm helps to identify the similar treands that may occur over the time period across multiple arrays of sequenced data. This mathematical formal serves very well as an effective data mining technique when algorithmically comparing the different sets of time-ordered data. DTW has offered a better means of identifying similar trends across sets of sequenced records and observations.

Hence, this paper seeks to explore these time-series data mining techniques to discover patterns and trends of similar car park sites within 29 shopping retail malls.


Analysis & Visualization Tools

The team will be use 4 tools for analysis and visualization, (1) Microsoft Excel, (2) SAS Enterprise Miner, (3) Microsoft SQL Server Integration Services (SSIS) and (4) JMP Pro.

Microsoft Excel is a spreadsheet application that is designed for calculation, graphing charts and visual aids, and pivot tables. In this case, the team will be using it to analyze the data for phase 1, generating charts and calculation for each car park sites.

SAS Enterprise Miner is analytical software that helps to streamline and simplify the data mining process. This will allow the easy retrieval of the datasets and perform analysis. Additionally, SAS Enterprise Miner allows user to perform descriptive, predictive and time-series analysis on huge amount of data. The software also has interactive visualization function and ease-to-use user interface that help to perform most of the task by drag and drop functionality.

Microsoft SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and data transformations solutions. Integration services will allow user to extract, transform the data and load it onto the database.

JMP Pro is the advanced version of JMP, it is created for users who need sophisticated modeling techniques. JMP Pro is a statistical analysis software from SAS that provides a platform for interactive data visualization, exploration, analysis and communication.

For this project, the team will be using these analytical tools to gather and discover new insights of the car park overspill patterns.

Reporting Tools

Apart from analysis and visualization tools, the team will be using 2 software for reporting, namely (1) Microsoft Word, a word processor software, and (2) Microsoft PowerPoint, a slideshow presentation software.

Collaboration tools

Lastly, the team are using (1) Dropbox, file hosting services, (2) Google Drive, cloud storage web application and (3) SMU Wikipedia, a SMU encyclopedia web page that is built for collaborations, to collaborate within the team and external stakeholders too.


Data Collection

The traffic planners’ objective is to review the current parking situation in 4 different types of developments in 65 different locations (30 RM, 15 F&B, 10 HC and 10 CC). Due to the different nature of each premise, different methods are used to gather the count data of the vehicles and patrons. Additionally, intercept survey was also carried out to gather sample data. In an extremely access point, automated counters were deployed to assist the count process. The entire data collection process occurred between May 2015 and October 2015.

Each dataset contains the vehicle count and human count of the particular premises. For instance, in the Retail Mall settings, there were a handful of enumerators deployed at the entrance point and the exit point of the building as well as the car park. These enumerators were deployed in pairs or trios. There was a pair of enumerators that was in-charged of counting the number of people entering and exiting the building. One of them was in-charged of counting the inbound traffic of people entering the building while the other counting the outbound of human traffic exiting the building. Then, a trio of enumerators was deployed to count the vehicles (motorcycles included). One of them was assigned to count the number of vehicles inbound into the premises and also the passenger(s) on-board, the other was in-charged of counting the outbound vehicles and passenger(s) on-board exiting the premises and the last enumerator was in-charged of finding out the overspill demand through counting the number of vehicles queuing to enter the car park and observing and noting down the number of vehicles parking or waiting illegally along the side streets. The same data collection process was done for both the Community Centers settings and Retails and F&B clusters.

However, for the Hawker Centers, the trio enumerators that were counting the vehicles entering and exiting goes through the same process as mentioned in the previous paragraph. The pair of enumerators that was in-charged of the human count patrolled the Hawker Centers instead of stationing at the entrance or exit point. One of them was assigned to count the number of seated patrons while the other counted the number of patrons queuing at the stalls. The enumerators made their rounds every 15 minutes to count the human occupancy of the Hawker Centers.

These data were collected between 10am to 9pm. Each data was recorded in blocks of 15 minutes timeframe. In the other words, between the periods of an hour, there would be 4 records (12pm, 12.15pm, 12.30pm and 12.45pm) being documented. Lastly, the data collection process for each car park site locations lasted for two days; one on a weekday (non-peak day) and one on a weekend (peak day).

Last but no least, a dedicated team of people were deployed on the ground carrying out intercept survey, interviewing patrons and collecting survey results. All these information are been compiled into a single spreadsheet for the core project team members to analyze before they report their findings to the authority.


Pre-Survey Report

Pre-surveys reports are used to collate each site’s information, such as its unique characteristics and assess eligibility. This will allow us to better understand the surrounding of that particular premise and the nature of business and uniqueness of that premise. Hence, helping us to determine the most appropriate survey methodologies to achieve the results.

Human Count

The enumerators will count the total number of patrons and passenger(s) on board of the vehicle. Thus, with this information, we will be able to determine the total number of human entering the premises in that particular time.

Vehicle Count

Likewise for the total vehicle count, the results gathered from the enumerators show the number of inbound and outbound vehicle, roadside parking and overspill count.

Interview Survey

As for the interview survey, demographic profile and the travel behavior are recorded. Demographic profiles capture the citizenship, gender, age and ethnicity of the patrons whereas the travel behavior survey takes note of the number of patrons visiting the amenities that particular day, their frequency of visit, the main purpose of visitation, duration of visit, their form of commute to the premise that day and their companion for their trip there. Additionally, for drivers, they will need to input in more information such as their vehicle parking place, the reason for parking there, as well as their accessibility from where they parked to the amenities they are visiting on that particular day.


Phase 1 (Jan 2016)

In order to gain insights from the current parking situation at these 65 selected location, we have to gather all the information collected previously and further process the information. The 4 key components, as mentioned in the previous section, are Pre-Survey Report, Human Count, Vehicle count and Interview Survey.
An illustration of the analytics methodology is as shown below:


Hence, with that, we will be able to derive both the qualitative and quantitative results findings. Qualitative results help us to gain an understanding of the underlying reasons, motivations and behavioral for visiting the premise. Results from survey questionnaires are considered qualitative results findings. On the other hand, quantitative results are facts and figures that quantify data and generalize results from a sample population. As such, the total number of human count, vehicle count, roadside parking count and overspill count are considered quantitative findings.

With the analysis from both the qualitative and quantitative figures, we will be able to draw insights from the current parking situations in that particular car park. Therefore, a report and info-graphics will be created to complement the key findings in each particular parking sites. The team will be using Microsoft Excel to work on the analysis and findings and using other tools like Microsoft Word and Microsoft PowerPoint to create the final report and info-graphics.

Phase 1 Deliverables

Thus far, the team has completed phase 1 of the project in January, which is to assist the consultancy firm to understand the car park issues in 6 development sites (AMK hub, AMK hawker centre, Compass Point, Jalan Salang F&B cluster, Rail Mall F&B cluster and Sengkang CC). As mentioned earlier, the consultancy firm has categorized AMK hub and AMK hawker centre to be grouped together, due to the distance between the two sites is pretty nearby to each other. Likewise for Compass Point and Sengkang CC to be grouped together to form Sengkang Cluster. As for the deliverables for phase 1, the team has analyzed and completed these deliverables for the consultancy firm:

Excel files

1. AMK Hub.xlsx (7 spreadsheets) 2. AMK Hawker Centre.xlsx (7 spreadsheets) 3. Compass Point.xlsx (7 spreadsheets) 4. Sengkang CC.xlsx (7 spreadsheets) 5. Rail Mall.xlsx (7 spreadsheets) 6. Jalan Salang.xlsx (7 spreadsheets)


1. AMK Cluster.pdf (82 pages) 2. SengKang.pdf (99 pages) 3. Rail Mall.pdf (87 pages) 4. Jalan Salang.pdf (62 pages)

Info-graphics files

1. AMK Hub.pdf (7 informative poster images) 2. AMK Hawker Centre.pdf (8 informative poster images) 3. Compass Point.pdf (8 informative poster images) 4. Sengkang CC.pdf (9 informative poster images) 5. Rail Mall.pdf (11 informative poster images) 6. Jalan Salang.pdf (11 informative poster images)

The excel files are processed to assist in analyzing and plotting out the charts for the info-graphics. The reports are generated to share and report the insights found in the development. Lastly, the info-graphics documents are prepared to capture and share the main insights from the respective sites.

Phase 1 General Findings

It is concluded that residents living within the region mainly patronize the development cluster. The human traffic profiles show patrons on those development sites to be more often visited during the late afternoon to evening periods, which coincide with after school hours and the start times of classes, activities held at the cluster and dinner peak period. Most of the observation gathered is that there are no distinct anomalies to suggest patronage of other purposes, i.e. famous food stalls, mall being a popular shopping location for out-of-towners. 

In general, weekday and weekend parking demand appeared to be similar in the traffic flow pattern during the lunch hour whereas the weekend saw a spike during the dinnertime. For the retail malls, parking occupancy findings show that parking supply for the cluster to be sufficient for the current demand and there is spare capacity in the public car parks, which are within 5 minutes’ walk to the development cluster, to handle overspill parking if the situation does occur.

However, for the F&B clusters, the local residents staying in the landed properties dominate 
 roadside parking along the surveyed roads. While it was evident that the illegal parking was partly contributed by the patrons visiting the development, it was also observed by the Site Supervisor that residents contributed their fair share of the illegally parked vehicles as some of the cars were parked for throughout the survey hours. 

The team felt that the one-day survey for a weekday and a weekend provided has limited insight into the traffic patterns and trends of a particular development.

Lastly, the team also felt that the survey results are too focused on individual sites, therefore, there is a need to draw more insights on the similarity of the car parks. And this will be done in phase 2 of the project.  


Interim Analysis

Data Cleaning and Explorations

The data we received from MRC was site based and split up into individual excel files with a lot of unnecessary data. After Exploratory Data analysis there is a need to transform the time-based data into appropriately time stamped time series data in order to perform further analysis. For our group we utilized SQL Server Integration Services 2010 to look through all excel files and extract relevant data, as we were comfortable using this software from previous projects.

Filtering and extracting data

There were many variables in the excel sheet that was not helpful for our phase 2 analysis. We have decided on using 6 variables for our analysis, which are the most relevant to what we would like to analyze. The variables are peak_occupancy, non_peak_occupancy, peak_car_in, non_peak_car_in, peak_car_out, non_peak_car_out. We also filtered out 112 Katong as it was a pilot site and there were many missing data.

Combining Data

As the data we received from MRC was site based and split up into individual excel files, there is a need for us to combine all the sites together after filtering and extracting data from individual excel files. This file, includes attributes such as time, car_park, total_lots, peak_occupancy, non_peak_occupancy, peak_car_in, non_peak_car_in, peak_car_out, non_peak_car_out. There are a total of 28 sites that we plan to carry out our analysis.

Recoding Time

As the time given was in ##:##AM/PM format, there was a need for us to recode it into numbers in order for us to run Time Series Analysis on SAS Enterprise Miner. We used SAS Enterprise Guide to recode our time to Time ID starting from 1 before loading the cleaned data into SAS Server.




Figure above shows that there are unnecessary rows and columns of data as they are empty. Figure 4 below shows that the recoded data after cleaning has been done.
