Time-series Analysis on Singapore Public Transportation Train Network Data Source
Background | Data Source | Methodology |
---|
Source |
Land Transport Authority (LTA) provides the data sets through Learning Analytics Research Centre (LARC) research labs. The dataset provided by LARC is currently from a MySQL database which consists of the following tables:
- Bus_service_mapping
- Location_gis_mapping
- Location_mapping
- Lta_ride
The dataset is a weeks’(1st November 2011 – 6th November 2011) worth of smart card (EZ-Link) transactions used in Singapore’s public transport and it consists of both bus and MRT transactions. As we are only interested in the MRT transactions, we will be looking into 2 tables basically Location_mapping and Lta_ride. We extracted the data by taking a database dump and added a conditional statement to filter transport_type by "RTS" to only include the train dataset.
Below shows the screenshot of the raw data set for both bus and MRT transactions which is worth approximately 33 millions rows of data.
After filtering to only include the trains transactions, the amount of data reduced to approximately 10 millions rows.
Exploratory Data Analysis Data Preparation |
The above showed the process of the EDA data preparation and there are several steps that needs to be done before performing descriptive analysis or running summary statistics. Here are the steps taken:
- Extract hour of entry_time and exit_time
- Extract the minutes from entry_time and exit_time
- Recoded the entry and exit time of midnight to 24 instead of 00
- Extract the day of the week from the entry_date
- Map location_id to retrieve location_name from location_mapping table
- Combine all the recoded columns into a single data file
Time-series Data Preparation |
The above showed the process of the time-series data preparation. There is a need to transform data into time-series interval that is readable by the application. Here the steps taken:
- Filter relevant data (card_number_e, Commuter category, entry_time, exit_time, origin_location_id, destination_location_id)
- Create new column for aggregation of entry and exit time-stamp into 15 minutes interval and auto-increment number for time interval
- Map location_id to retrieve location_name from location_mapping table
- Combined all the data into a single data file and segment into adult, senior citizen and student
- Load into SAS Server