Atom: Analysis

From Analytics Practicum
Jump to navigation Jump to search
AtomTeamLogo.jpg


AtomHome.png

Home

  AtomTeam.png

Team

  AtomProjectOverview.png

Overview

  AtomDocumentation.png

Documentation

  AtomAnalysis.png

Analysis

 

Interim Analysis

Data Cleaning and Explorations

The data we received from MRC was site based and split up into individual excel files with a lot of unnecessary data. After Exploratory Data analysis there is a need to transform the time-based data into appropriately time stamped time series data in order to perform further analysis. For our group we utilized SQL Server Integration Services 2010 to look through all excel files and extract relevant data, as we were comfortable using this software from previous projects.

Filtering and extracting data

There were many variables in the excel sheet that was not helpful for our phase 2 analysis. We have decided on using 6 variables for our analysis, which are the most relevant to what we would like to analyze. The variables are peak_occupancy, non_peak_occupancy, peak_car_in, non_peak_car_in, peak_car_out, non_peak_car_out. We also filtered out 112 Katong as it was a pilot site and there were many missing data.

Combining Data

As the data we received from MRC was site based and split up into individual excel files, there is a need for us to combine all the sites together after filtering and extracting data from individual excel files. This file, includes attributes such as time, car_park, total_lots, peak_occupancy, non_peak_occupancy, peak_car_in, non_peak_car_in, peak_car_out, non_peak_car_out. There are a total of 28 sites that we plan to carry out our analysis.

Recoding Time

As the time given was in ##:##AM/PM format, there was a need for us to recode it into numbers in order for us to run Time Series Analysis on SAS Enterprise Miner. We used SAS Enterprise Guide to recode our time to Time ID starting from 1 before loading the cleaned data into SAS Server.

AtomI01.png


AtomI02.png


AtomI03.png


Figure above shows that there are unnecessary rows and columns of data as they are empty. Figure 4 below shows that the recoded data after cleaning has been done.

AtomI04.png

Initial Approach

Initially our approach was to manually group the parking establishments by region before doing time series analysis. However after we consulted with Professor Kam on Feb 18, the errors of our method was highlighted to us: The dataset should be telling us what are the groups and patterns instead of us manually deciding how to segregate the data.

Revised Analysis Approach

Time Series Methodology

We utilized SAS Enterprise Miner, which simplifies time series data mining for huge amounts of data. Additionally Enterprise Miner implements Dynamic Time Warping which is an algorithm for measuring similarity between two times based sequences which might initially vary, Enterprise Miner can identify patterns and similarities by shifting time series against each other.

AtomI05.png


Dataset

First we uploaded our transformed data set into the SAS Server and in Enterprise Miner we retrieved this data set and defined the properties for Enterprise Miner to correctly identify the roles of each variable.

AtomI06.png


Time Series Data Preparation (TSDP)

TSDP transforms the dataset that is readable by Enterprise Miner, i.e. time stamped data.

Multiple Time Series Plot

AtomI07.png


TSID Map Table

AtomI08.png


TSID map table shows the original Dataset mapped into different time series and their corresponding Car_Park names.

Reduced Time Series Plot

AtomI09.png


TSID Map Summary Table

AtomI10.png


TSID Map Summary Table shows the different Car_Parks and their unique Time Series counts, in total we have 28 Retail Mall car parks to analyze.

Time Series Similarity (TSS)

Allows us to analyze similarities between different time series grouped by clusters. Dynamic Time Warping algorithm is applied in Enterprise Miner to match the different lengths of similar time series in one group.

Cluster Constellation Plot

AtomI11.png


Cluster Dendrogram

AtomI12.png


A tree hierarchy displays the steps and how the clusters are actually formed.

Distance Map

AtomI13.png


Distance map shows how similar the Time Series data is based on distance between clusters. Red indicates a large distance and dissimilarity and Blue indicates little distance and similarity.


Peak Occupancy Analysis

AtomI14.png


The resulting time series graphs show that there might be some similarities between the car parks, however we would not be able to identify them without any further analysis of the Time series.

AtomI15.png



According to the Dendrogram generated, it appears that at Point 1, Enterprise Miner makes a big jump in order to combine 2 clusters. This is not ideal when trying to interpret and analyze patterns and similarities within the clusters.
At this stage our group had to identify what is a reasonable combination of clusters and what is considered unreasonable based on the Dendrogram. After research and advice from our supervisor, we decided that at Point 2 the clusters are still distinct enough and will be easier to identify any similarities. At point 2 there should be 7 distinct clusters within this set of time series data.

Findings

By repeating the steps taken above to load the data into the SAS server, we generated 7 smaller subsets of time series data sets from the main data set. From here our group identified the car parks that belonged to the different clusters, namely:

Cluster 1

AtomI16.png


AtomT01.png


An interesting cluster is cluster 1, the retail malls that belong in this cluster are located within or around Ang Mo Kio, Hougang and Novena, which are very close to each other geographically.

Cluster 1 shows large occupancy in the car parks at 1000 hrs when the data collection starts. There is only one large peak in the morning and the number of cars during peak hour stays almost the same throughout the day, only dropping near 1800 hrs. This pattern suggests that there might be a lot of parking lots occupied by workers of these shopping malls, or there is just a high demand for car parks and any cars that incoming cars replace leaves quickly. This can be further investigated utilizing different target variables.

Cluster 2

AtomI17.png


AtomT02.png


Similar to cluster 1, there is a steady rate of car occupancy in cluster 2 during peak hours and suggests similar trends as cluster 1. However the difference is that nearing 2100 hrs there is no drastic drop shown in cluster 1, this suggests that there are a lot of activities available at these retail malls after 2100 hrs that attract patrons to continue parking even near closing hours.


Cluster 3

AtomI18.png


AtomT03.png


Cluster 3 shows 2 obvious spikes and 2 dips throughout the day. The first spike occurs at around 1200 hrs and then starts to dip at 1345 hrs. The second spike begins at around 1800 hrs and begins to dip at 2015hrs This patterns in spikes and dips in occupancy suggests many patrons visiting these retail malls for dining purposes, parking at the retail mall only for the duration of their meals.

Cluster 4

AtomI19.png


AtomT04.png


Cluster 4 consists of Boon Lay Shopping Centre. It is clustered by itself as the development site does not have a development carpark and the carpark used for analysis was an open carpark shared by multiple HDBs.

Cluster 5

AtomI20.png


AtomT05.png


Cluster 5 consists of Century Square, Clementi Mall, Hougang Mall, Seletar Mall and Thomson Plaza. Purpose of visit for these 5 sites are very evenly spread out between shopping, F&B and supermarket. These are characteristics of heartland malls whereby patron visit these retail malls for supermarket purposes. There are consistent increasing trend from morning till afternoon for majority of the malls in these clusters and a sharp fall during late afternoon.

Cluster 6

AtomI21.png


AtomT06.png


Cluster 6 consists of Changi City Point, Vivo City, West Mall, and West Coast Plaza. There are churches located very near to all development sites. The peak for these clusters is in the noon and evening where services end. A possible reason for this cluster is that patrons visit this development after working hours, and they spent their dinner there before returning home.

Cluster 7

AtomI22.png


AtomT07.png


Cluster 7 consists of Pioneer Mall. It is clustered by itself as the development site does not have a development carpark and the carpark used for analysis was a multi storey carpark which is being shared with Blk 638A.

From the results we identified a few interesting similarities and insights. Boon Lay Shopping Centre and Pioneer Mall appear to be outliers, two retail malls located in the West of Singapore that are significantly different from the rest of the retail malls recorded in the data set but not similar enough to belong in the same cluster.

Moving forward after the Interim Report, our group will discuss how to analyze cluster 4 and 7 under advisement from our Supervising Professor.

Additionally we would like to look into different target variables in order to determine if the retail malls exhibit more similarities other than occupancy rate during peak hours.

Lastly, we would like to look into more establishments other than retail malls such as F&B Clusters and Community Centers.


Off Peak Occupancy

Findings

By repeating the steps taken above to load the data into the SAS server, we generated 6 smaller subsets of time series data sets from the main data set. From here our group identified the car parks that belonged to the different clusters, namely:

AtomI23.png


There are a total of 6 clusters for the analysis of occupancy.

AtomI24.png


These 6 clusters were identified using the Cluster Dendrogram where very large merging of clusters is deemed as unsuitable for similarity analysis. As the diagram shows at the cut off there are no large merging of clusters before this point.

AtomI25.png


The resulting time series graphs show that there might be some similarities between the car parks, however we would not be able to identify them without any further analysis of the Time series.

Cluster 1

AtomI26.png


AtomT08.png


Starts increasing throughout the day from 10:00am onwards. Increasing trend increases by a larger amount starting at 11:30am. At 1:45pm, the non-peak occupancy starts to decrease quite sharply but never reaching back the 11:30am levels.
The non-peak occupancy levels continue falling albeit very slowly until 5:30pm where a spike in occupancy can be observed. The most pronounced of which coming from Vivocity development site while there is a smaller rate of increase for the other 4 developments.
The interesting thing about this pattern is that the dip after 1:45pm is not extremely pronounced, the largest dip is only around 25% of the maximum occupancy (1:15pm compared to 5:15pm). This pattern is unique to this cluster and is not reflected in other clusters where the dip in occupancy is much more pronounced after the meal timings.
This unique cluster behavior is reflected according to the main purpose of visit as per site survey data. Majority of patrons (more than 50%) arrive for Food & Beverage and Shopping. Food & Beverage behavior is reflected in the spikes during meal times and patrons arriving for shopping purposes cause the fall of occupancy to not be very pronounced after meal times as compared to other clusters.
In addition to patron behavior of the developments, these sites are also all located very close to or right beside bus interchanges. This common characteristic throughout developments might be a critical factor that causes the developments to be similar and clustered together.


Cluster 2

AtomI27.png


AtomT09.png


It is found that majority visit retail malls in this clusters for shopping and F&B. According to the time series graph, all retail malls in the cluster starts to gradually increase in the number of occupancy from 11:45pm to 1:00pm before dropping and maintaining at a low level of occupancy. It starts to gradually increase again from 6pm to 8pm. The highest level of occupancy for all the retail malls in this clusters fall during 6pm to 9pm.

The interesting thing about this pattern is that the dip after 1:00pm is quite significant for around 30% of the maximum occupancy and retains a constant decreasing trend in the level of occupancy. This pattern is unique to this cluster and is not reflected in other clusters where the dip in occupancy is lesser after the meal timings. This shows that visitors frequent the retail malls in this cluster during meal-time (i.e. lunch break and end work).

Cluster 3

AtomI28.png


AtomT10.png


Cluster 3 consists of Boon Lay Shopping Centre. It is clustered by itself as the development site does not have a development carpark and the carpark used for analysis was an open carpark shared by multiple HDBs.

Cluster 4

AtomI29.png


AtomT11.png


It is found that majority visits retail malls in this clusters for F&B and other activities (social activities, enrichment class, cinema, and supermarket). According to the time series graph, the peak hours of retail malls in this cluster fall during 12 noon to 12:45pm and 6:45pm to 8:00pm. However, the overall graph of majority retail malls in this cluster is very inconsistent. This might be due to the fact that some of the retail malls in this clusters being right next to each other.
For example, Tampines One and Century square, and Novena Square, Square2 and United Square. As the retail malls are connected and right next to each other, the number of car occupancy of connected retail malls might be dependent with each other.

Cluster 5

AtomI30.png


AtomT12.png


This cluster shows an increase in car occupancy throughout the day from 10:45am onwards. There is an increasing trend from 10:45am to 12:45pm. At 12:45pm, the non-peak occupancy starts to decrease very sharply all the way till 5:45pm before it starts increasing again from 5:45pm to 7:45pm.
The interesting thing about this pattern is that the dip after 12:45pm is very significant for around 70% of the maximum occupancy and it does not retains a constant decreasing trend in the level of occupancy. This pattern is unique to this cluster and is not reflected in any other clusters where the increase in level of occupancy only happens from 10:45am to 12:45pm (breakfast + lunch) and 5:45pm to 7:45pm (dinner).
This shows that there is a high percentage of visitors frequenting the retail malls in this clusters solely for meals. Both malls are located very near to a University (SUTD and NUS) and a Hospital (CGH and NUH). Moreover, shuttle buses are also provided to the retail malls in this cluster during meal time.

Cluster 6

AtomI31.png


AtomT13.png


Cluster 6 consists of Pioneer Mall. It is clustered by itself as the development site does not have a development carpark and the carpark used for analysis was a multi storey carpark which is being shared with Blk 638A.


CONCLUSION

Limitations and Assumptions

Although the team managed to conduct a proper time series analysis with the raw data set provided by the consultancy firm, there were many limitations that might cause issues as well as potential avenues of improvement for further studies.

The team believes there are many ways that may help to improve and strengthen the current project analysis and findings. For instance, increasing the scalability of the datasets without hindering on time and performance will provide a better understanding and greater insights of the car park sites. This will also improve the team effort.

One of the main limitations includes the limited period of data collection. Instead of only having one day each for peak off peak periods from 1000hrs to 2100hrs, ideally the data should be collected over a few weeks or months. A larger dataset will allow the identification of more seasonal patterns, such as monthly or quarterly patterns. Additionally, instead of just being limited to analyzing 28 retail malls car park sites, we can also include other shopping malls.

The team holds the opinion that the data recorded were en car park and patron demographic information.

Possible Avenue for Future Works

As briefly mentioned earlier, our team analysis was made based on 28 retail malls car park sites. By increasing the number of car park sites and other developments (HDB estate and etc.) in the analysis, it will help the team to further improve on the analysis and gain deeper insights.

Rather than just counting the raw numbers of cars in each lot, a better form of data collection can be to place video cameras to keep track of the duration each car remains in the parking lot. This opens up more avenues for car park analysis and allows better insights in future studies.

Another recommendation that could be implemented for any future studies conducted is to separate development sites based on the results of this study. As already highlighted, there are certain developments that did not include a development car park. This produced the outliers observed above. The presence of these outliers shows that developments with car parks should definitely be conducted in separate studies from developments without car parks.

Apart from having more records, another initiative could be to build an interactive dashboard to visualize the data. This can be achieved by using R programming language.

Summary

In conclusion, we hope that through the work performed, we are able to bring about much needed insights to the parking allocation policies of developments. Through time series data mining, we hope to have adequately highlighted to the local authority several ways to improve the system, though identifying the trends of parking occupancies based on activity or location of developments.

Currently, the authority utilizes a number of retail shop policy, such as development car park being required to meet a minimum number of parking lots based on the number of retail shops within the development. The authority will manually analyse each proposal on a case by case basis and find out if any developments requires additional parking lots. For example, having an an interchange there may require more lots from the developer.

The authority should consider looking into a Varied Pricing system that utilizes time series to predict the demand of parking lots and vary the car park pricing based on projections. Many locations overseas already utilize this system at present, such as San Francisco in the United States where SFPark has been successfully.

However, to carry out this study and accurately project the demand and price level of car parks, further analysis and data collection needs to be carried out. Our group hopes that despite the limitations mentioned above, there is enough support and evidence to justify an investment into this analysis, as parking issues not only cause congestion on the road due to overspill, but also result in potential accidents.