Time-series Analysis on Singapore Public Transportation Train Network Data Source

From Analytics Practicum
Jump to navigation Jump to search

Home

Project Overview

 

Findings

 

Project Documentation

 

Project Management

Background Data Source Methodology
Source

Land Transport Authority (LTA) provides the data sets through Learning Analytics Research Centre (LARC) research labs. The dataset provided by LARC is currently from a MySQL database which consists of the following tables:

  • Bus_service_mapping
  • Location_gis_mapping
  • Location_mapping
  • Lta_ride

The dataset is a weeks’(1st November 2011 – 6th November 2011) worth of smart card (EZ-Link) transactions used in Singapore’s public transport and it consists of both bus and MRT transactions. As we are only interested in the MRT transactions, we will be looking into 2 tables basically Location_mapping and Lta_ride. We extracted the data by taking a database dump and added a conditional statement to filter transport_type by "RTS" to only include the train dataset.

Below shows the screenshot of the raw data set for both bus and MRT transactions which is worth approximately 33 millions rows of data.
CTS Pic1.png

After filtering to only include the trains transactions, the amount of data reduced to approximately 10 millions rows. CTS Pic2.png

Exploratory Data Analysis Data Preparation

CTS Pic3.png
The above showed the process of the EDA data preparation and there are several steps that needs to be done before performing descriptive analysis or running summary statistics. Here are the steps taken:

  • Extract hour of entry_time and exit_time
  • Extract the minutes from entry_time and exit_time
  • Recoded the entry and exit time of midnight to 24 instead of 00
  • Extract the day of the week from the entry_date
  • Map location_id to retrieve location_name from location_mapping table
  • Combine all the recoded columns into a single data file
Time-series Data Preparation

CTS Pic4.png
The above showed the process of the time-series data preparation. There is a need to transform data into time-series interval that is readable by the application. Here the steps taken:

  1. Filter relevant data (card_number_e, Commuter category, entry_time, exit_time, origin_location_id, destination_location_id)
  2. Create new column for aggregation of entry and exit time-stamp into 15 minutes interval and auto-increment number for time interval
  3. Map location_id to retrieve location_name from location_mapping table
  4. Combined all the data into a single data file and segment into adult, senior citizen and student
  5. Load into SAS Server