ANLY482 AY2017-18T2 Group02 Project Overview

From Analytics Practicum
Revision as of 23:06, 14 January 2018 by Xzho.2014 (talk | contribs)
Jump to navigation Jump to search
Home   Project Overview   Findings & Insights   Documentation   Project Management   Back to project list



Arup Singapore Pte Ltd is an engineering consultancy firm. Starting from 2008, Arup was involved in the alignment planning and design of the Downtown Line phase 3. Prior to construction, Arup has put in extensive efforts to design the train stations and tunnelling, accounting for geological and surrounding building data.

Arup's design philosophy involves design across multiple phases. At the end of each phase, Arup will consult with contractors to determine differences in recorded data and readjust their design accordingly. However, many such projects are on a case-by-case basis and learning from each experience is not documented.

With the recent opening of the Downtown line to public, it is timely to review the as-built plans and explain the differences from the prediction stage. This will help Arup close the loop from the planning phase to the end result. Findings from the project will be valuable for Arup when engaging on similar engineering projects in the future, including the upcoming Thomson-East Coast Line.

OBJECTIVES

Key Objectives

There are 2 key objectives:

1. Identifying design features that have high probability for re-design in rail design

2. Building a predictive model to assist engineers in rail design planning and to serve as an error catching tool

Predictive Analytics

In order for data modelling to attempt to reach the heights of acting as a true supplement or even replacement for engineering design, it’s necessary to achieve predictive accuracy sufficient to meet the stringent standards of safety. The model must be nuanced enough to predict design parameters for a wide range of components meant to function under diverse geological conditions, while having adequate levels of buffer to observe safety standards.

Expected Outcomes

We aim to survey several possible predictive approaches, and identify models with high probability of success at modelling for rail engineering design. In order to achieve this while keeping expectations reasonable, we target a level of prediction that is statistically significant, i.e. distinguishable from random guesses.

Planned Deliverables

Given the limited data size, our expected outcomes are unlikely to be operationalized in actual engineering design. However, we plan to commit to laying a good foundation for further work in predictive analysis for rail tunnel engineering. We will submit a technical paper detailing our methodology, and the results, accompanied by any code written in the process.

METHODOLOGY

While we are still awaiting our data, we have decided on 3 steps to go about processing our data, Cleaning, Training and Cross-validation.

Clean

  1. Convert text-based data into categorical/numerical features
  2. Check all features for missing values, and treat as appropriate in each case
  3. Check features for collinearity - we will only reduce dimensions right before training, because certain models do not require removal of collinear variables, and in certain case
  4. Identify sparse features as potential features to be dropped or specially treated depending on the model used
  5. Perform normalization or binning on skewed data, as appropriate

 

Train

There are effectively two separate problems we are trying to solve:

i) Regression to determine design parameters

ii) Categorization to predict which features will require redesign

As such, we will discuss the methodology for these two separately.

 

A) Predicting design parameters

Since the feature we are trying to predict is likely to be numerical, we have to attempt some kind of regression or curve fitting approach.

 

B) Predicting redesign

Since the feature we are trying to predict is categorical, we can attempt a variety of categorization approaches.

 

  1. Logistic regression - Since the dependent variable is binary - either it does or does not require redesign, logistic regression is a natural choice.
  2. Association rule mining - Given the mixed nature of data, normalization and bucketing of features can make association rule mining a relatively direct approach to categorization, not requiring too much parameter tuning or feature transformation. We may attempt to utilize the apriori or eclat algorithm depending on the size of data set available
  3. Gradient boosted trees - In order to handle the likely collinearity between several independent variables, a tree-based approach seems like a reasonable approach.
  4. Naive Bayes - Similar to the regression problem, a Bayesian approach in categorization helps us avoid problems in overfitting that models like logistic regression might face due to smaller data sets relative to the number of features being categorized.



Cross-validation & Testing

All the models are to be trained on a fixed set of data, i.e. we will perform a 60-20-20 split on the available data, and reserve one of the 20% sets for cross validation to select the best performing model.

Finally, the selected model will be tested on the final 20% data set to check for overfitting to the validation set, to determine it’s predictive capability.

Since design-and-build happens in batches, we will also examine whether cross-training the model with data from different batches has improved or reduced success in prediction, to provide another measure of overfitting.


STAKEHOLDERS

Supervisor: Prof. Kam Tin Seong

Sponsor: Arup Singapore Pte. Ltd - Chris Deakin: Rail Leader, Digital Engineering Leader

Students: Friedemann & Josh