Difference between revisions of "ANLY482 AY2017-18T2 Group02 Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
Line 79: Line 79:
 
<font size =3 face=Helvetica >
 
<font size =3 face=Helvetica >
  
 +
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">METHODOLOGY</font></div></div>==
 +
 +
<font size =3 face=Helvetica >
 +
<p>While we are still awaiting our data, we have decided on 3 steps to go about processing our data, Cleaning, Training and Cross-validation.</p>
 +
<p><em><span style="font-weight: 400;">Clean</span></em></p>
 +
<ol>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Convert text-based data into categorical/numerical features</span></li>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Check all features for missing values, and treat as appropriate in each case</span></li>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Check features for collinearity - we will only reduce dimensions right before training, because certain models do not require removal of collinear variables, and in certain case</span></li>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Identify sparse features as potential features to be dropped or specially treated depending on the model used</span></li>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Perform normalization or binning on skewed data, as appropriate</span></li>
 +
</ol>
 +
<p>&nbsp;</p>
 +
<p><em><span style="font-weight: 400;">Train</span></em></p>
 +
<p><span style="font-weight: 400;">There are effectively two separate problems we are trying to solve: </span></p>
 +
<p><span style="font-weight: 400;">i) Regression to determine design parameters</span></p>
 +
<p><span style="font-weight: 400;">ii) Categorization to predict which features will require redesign</span></p>
 +
<p><span style="font-weight: 400;">As such, we will discuss the methodology for these two separately.</span></p>
 +
<p>&nbsp;</p>
 +
<p><em><em><span style="font-weight: 400;">A) Predicting design parameters</span></em></em></p>
 +
<p><span style="font-weight: 400;">Since the feature we are trying to predict is likely to be numerical, we have to attempt some kind of regression or curve fitting approach. </span></p>
 +
<p>&nbsp;</p>
 +
<p><em><span style="font-weight: 400;">B) Predicting redesign</span></em></p>
 +
<p><span style="font-weight: 400;">Since the feature we are trying to predict is categorical, we can attempt a variety of categorization approaches.</span></p>
 +
<p>&nbsp;</p>
 +
<ol>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Logistic regression - Since the dependent variable is binary - either it does or does not require redesign, logistic regression is a natural choice. </span></li>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Association rule mining - Given the mixed nature of data, normalization and bucketing of features can make association rule mining a relatively direct approach to categorization, not requiring too much parameter tuning or feature transformation. We may attempt to utilize the apriori or eclat algorithm depending on the size of data set available</span></li>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Gradient boosted trees - In order to handle the likely collinearity between several independent variables, a tree-based approach seems like a reasonable approach. </span></li>
 +
<li style="font-weight: 400;"><span style="font-weight: 400;">Naive Bayes - Similar to the regression problem, a Bayesian approach in categorization helps us avoid problems in overfitting that models like logistic regression might face due to smaller data sets relative to the number of features being categorized.</span></li>
 +
</ol>
 +
<p><br /><br /></p>
 +
<p><em><span style="font-weight: 400;">Cross-validation &amp; Testing</span></em></p>
 +
<p><span style="font-weight: 400;">All the models are to be trained on a fixed set of data, i.e. we will perform a 60-20-20 split on the available data, and reserve one of the 20% sets for cross validation to select the best performing model.</span></p>
 +
<p><span style="font-weight: 400;">Finally, the selected model will be tested on the final 20% data set to check for overfitting to the validation set, to determine it&rsquo;s predictive capability.</span></p>
 +
<p><span style="font-weight: 400;">Since design-and-build happens in batches, we will also examine whether cross-training the model with data from different batches has improved or reduced success in prediction, to provide another measure of overfitting.</span></p>
 +
</font>
 +
 +
 +
<div style="margin:20px; padding: 10px; background: #ffffff; font-family: Trebuchet MS, sans-serif; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow:    7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 +
<font size =3 face=Helvetica >
 
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">STAKEHOLDERS</font></div></div>==
 
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">STAKEHOLDERS</font></div></div>==
  

Revision as of 23:06, 14 January 2018

Home   Project Overview   Findings & Insights   Documentation   Project Management   Back to project list



Arup Singapore Pte Ltd is an engineering consultancy firm. Starting from 2008, Arup was involved in the alignment planning and design of the Downtown Line phase 3. Prior to construction, Arup has put in extensive efforts to design the train stations and tunnelling, accounting for geological and surrounding building data.

Arup's design philosophy involves design across multiple phases. At the end of each phase, Arup will consult with contractors to determine differences in recorded data and readjust their design accordingly. However, many such projects are on a case-by-case basis and learning from each experience is not documented.

With the recent opening of the Downtown line to public, it is timely to review the as-built plans and explain the differences from the prediction stage. This will help Arup close the loop from the planning phase to the end result. Findings from the project will be valuable for Arup when engaging on similar engineering projects in the future, including the upcoming Thomson-East Coast Line.

OBJECTIVES

Key Objectives

There are 2 key objectives:

1. Identifying design features that have high probability for re-design in rail design

2. Building a predictive model to assist engineers in rail design planning and to serve as an error catching tool

Predictive Analytics

In order for data modelling to attempt to reach the heights of acting as a true supplement or even replacement for engineering design, it’s necessary to achieve predictive accuracy sufficient to meet the stringent standards of safety. The model must be nuanced enough to predict design parameters for a wide range of components meant to function under diverse geological conditions, while having adequate levels of buffer to observe safety standards.

Expected Outcomes

We aim to survey several possible predictive approaches, and identify models with high probability of success at modelling for rail engineering design. In order to achieve this while keeping expectations reasonable, we target a level of prediction that is statistically significant, i.e. distinguishable from random guesses.

Planned Deliverables

Given the limited data size, our expected outcomes are unlikely to be operationalized in actual engineering design. However, we plan to commit to laying a good foundation for further work in predictive analysis for rail tunnel engineering. We will submit a technical paper detailing our methodology, and the results, accompanied by any code written in the process.

METHODOLOGY

While we are still awaiting our data, we have decided on 3 steps to go about processing our data, Cleaning, Training and Cross-validation.

Clean

  1. Convert text-based data into categorical/numerical features
  2. Check all features for missing values, and treat as appropriate in each case
  3. Check features for collinearity - we will only reduce dimensions right before training, because certain models do not require removal of collinear variables, and in certain case
  4. Identify sparse features as potential features to be dropped or specially treated depending on the model used
  5. Perform normalization or binning on skewed data, as appropriate

 

Train

There are effectively two separate problems we are trying to solve:

i) Regression to determine design parameters

ii) Categorization to predict which features will require redesign

As such, we will discuss the methodology for these two separately.

 

A) Predicting design parameters

Since the feature we are trying to predict is likely to be numerical, we have to attempt some kind of regression or curve fitting approach.

 

B) Predicting redesign

Since the feature we are trying to predict is categorical, we can attempt a variety of categorization approaches.

 

  1. Logistic regression - Since the dependent variable is binary - either it does or does not require redesign, logistic regression is a natural choice.
  2. Association rule mining - Given the mixed nature of data, normalization and bucketing of features can make association rule mining a relatively direct approach to categorization, not requiring too much parameter tuning or feature transformation. We may attempt to utilize the apriori or eclat algorithm depending on the size of data set available
  3. Gradient boosted trees - In order to handle the likely collinearity between several independent variables, a tree-based approach seems like a reasonable approach.
  4. Naive Bayes - Similar to the regression problem, a Bayesian approach in categorization helps us avoid problems in overfitting that models like logistic regression might face due to smaller data sets relative to the number of features being categorized.



Cross-validation & Testing

All the models are to be trained on a fixed set of data, i.e. we will perform a 60-20-20 split on the available data, and reserve one of the 20% sets for cross validation to select the best performing model.

Finally, the selected model will be tested on the final 20% data set to check for overfitting to the validation set, to determine it’s predictive capability.

Since design-and-build happens in batches, we will also examine whether cross-training the model with data from different batches has improved or reduced success in prediction, to provide another measure of overfitting.


STAKEHOLDERS

Supervisor: Prof. Kam Tin Seong

Sponsor: Arup Singapore Pte. Ltd - Chris Deakin: Rail Leader, Digital Engineering Leader

Students: Friedemann & Josh