|
|
(4 intermediate revisions by the same user not shown) |
Line 21: |
Line 21: |
| | | |
| | | |
− | ==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica'; border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">SPONSOR BACKGROUND</font></div></div>== | + | ==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica'; border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">PROJECT BACKGROUND</font></div></div>== |
| | | |
− | <p>Arup Singapore Pte Ltd is an engineering consultancy firm. Starting from 2008, Arup was involved in the alignment planning and design of the Downtown Line phase 3. Prior to construction, Arup has put in extensive efforts to design the train stations and tunnelling, accounting for geological and surrounding building data. | + | <p>DonorsChoose.org is a non-profit organisation seeking to improve the education system in America. The DonorsChoose.org platform is a civic crowdfunding platform that allows public school teachers across America to reach out to potential donors. Donors can choose the type of projects to fund and can donate any amount to the cause of their liking. Since its founding in 2000, the platform has funded over a million projects and benefited over 27 million students. |
| </p> | | </p> |
| <p> | | <p> |
− | Arup's design philosophy involves design across multiple phases. At the end of each phase, Arup will consult with contractors to determine differences in recorded data and readjust their design accordingly. However, many such projects are on a case-by-case basis and learning from each experience is not documented.
| + | The process starts with teachers submitting their project proposals, detailing the resources and materials they require and how the resources will benefit their students. Upon submission to DonorsChoose.org, volunteers will review the project submission and determine whether it can be approved.As the number of project submissions is expected to increase beyond 500,000 in 2018, DonorsChoose.org has to scale their efforts in the project approval process as well. A prediction model will help facilitate the process, but close attention has to be paid to the model such that it can selectively discern deserving projects. |
− | </p>
| |
− | <p>
| |
− | With the recent opening of the Downtown line to public, it is timely to review the as-built plans and explain the differences from the prediction stage. This will help Arup close the loop from the planning phase to the end result. Findings from the project will be valuable for Arup when engaging on similar engineering projects in the future, including the upcoming Thomson-East Coast Line.
| |
− | </p>
| |
− | | |
− | ==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica'; border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">MOTIVATION</font></div></div>==
| |
− | | |
− | <p>A key issue we are trying to address is closing the design loop in engineering projects. Upon completion, engineers often move on to new projects without reconciling the differences observed from the planning phase and the as-built phase. We believe there is value in learning from the differences that have arisen from past projects. Data modelling can assist engineers in identifying parameters that would affect construction significantly and reduce uncertainty in future planning. While data modelling will not replace the need for engineer’s judgment, it can assist engineers as a probability-based error-catching tool. With the cumulation of our own personal interest and Arup’s stake in the project, we believe this project could help engineers realise the value of utilising data in their work.
| |
| </p> | | </p> |
| | | |
Line 42: |
Line 34: |
| | | |
| <p> | | <p> |
− | There are 2 key objectives:
| + | The objective our project would be to develop a model for DonorsChoose.org to predict the likely approval status of projects submitted by teachers. Based on past data on the project, the teacher and the school, we would seek to build a model that could determine the projects’ approval rate. Ideally, the model would have high precision and recall rate. |
− | </p>
| |
− | <p>
| |
− | 1. Identifying design features that have high probability for re-design in rail design
| |
| </p> | | </p> |
| <p> | | <p> |
− | 2. Building a predictive model to assist engineers in rail design planning and to serve as an error catching tool
| + | The expected outcome would be for DonorsChoose.org to automate part of its project screening process, and redirect efforts into examining proposals that need more assistance. The project will also shed insight into how organisations can utilize prediction models to scale manual processes in an efficient manner while preserving accuracy. |
− | </p>
| |
| | | |
− | <p>
| |
− | <b><span style="color:#800000">Predictive Analytics</span></b>
| |
| </p> | | </p> |
− | <p>
| |
| | | |
− | In order for data modelling to act as a supplement or even replacement for engineering design, it’s necessary to achieve predictive accuracy while meeting the stringent safety standards. The model must be nuanced enough to predict design parameters for a wide range of components meant to function under diverse geological conditions, while having adequate levels of buffer to observe safety standards.
| |
− | </p>
| |
− | <p>
| |
− | <b><span style="color:#800000">Expected Outcomes</span></b>
| |
− | </p>
| |
− | <p>
| |
− | We aim to survey several possible predictive approaches, and identify models with high probability of success at modelling for rail engineering design. In order to achieve this while keeping expectations reasonable, we target a level of prediction that is statistically significant, i.e. distinguishable from random guesses.
| |
− | </p>
| |
− | <p>
| |
− | <b><span style="color:#800000">Planned Deliverables</span></b>
| |
− | </p>
| |
− | <p>
| |
− | Given the limited data size, our expected outcomes are unlikely to be operationalized in actual engineering design. However, we plan to commit to laying a good foundation for further work in predictive analysis for rail tunnel engineering.
| |
− | We will submit a technical paper detailing our methodology, and the results, accompanied by any code written in the process.
| |
− | </p>
| |
− |
| |
− |
| |
− | ==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica'; border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">METHODOLOGY</font></div></div>==
| |
− |
| |
− | <p>While we are still awaiting our data, we have decided on 3 steps to go about processing our data, Cleaning, Training and Cross-validation.</p>
| |
− | <p><b><span style="color:#800000">Clean</span></b></p>
| |
− | <ol>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Convert text-based data into categorical/numerical features</span></li>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Check all features for missing values, and treat as appropriate in each case</span></li>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Check features for collinearity - we will only reduce dimensions right before training, because certain models do not require removal of collinear variables, and in certain case</span></li>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Identify sparse features as potential features to be dropped or specially treated depending on the model used</span></li>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Perform normalization or binning on skewed data, as appropriate</span></li>
| |
− | </ol>
| |
− | <p> </p>
| |
− | <p><b><span style="color:#800000">Train</span></b></p>
| |
− | <p><span style="font-weight: 400;">There are effectively two separate problems we are trying to solve: </span></p>
| |
− | <p><span style="font-weight: 400;">i) Regression to determine design parameters</span></p>
| |
− | <p><span style="font-weight: 400;">ii) Categorization to predict which features will require redesign</span></p>
| |
− | <p><span style="font-weight: 400;">As such, we will discuss the methodology for these two separately.</span></p>
| |
− | <p> </p>
| |
− | <p><em><em><span style="font-weight: 400;">A) Predicting design parameters</span></em></em></p>
| |
− | <p><span style="font-weight: 400;">Since the feature we are trying to predict is likely to be numerical, we have to attempt some kind of regression or curve fitting approach. </span></p>
| |
− | <p> </p>
| |
− | <p><em><span style="font-weight: 400;">B) Predicting redesign</span></em></p>
| |
− | <p><span style="font-weight: 400;">Since the feature we are trying to predict is categorical, we can attempt a variety of categorization approaches.</span></p>
| |
− | <p> </p>
| |
− | <ol>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Logistic regression - Since the dependent variable is binary - either it does or does not require redesign, logistic regression is a natural choice. </span></li>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Association rule mining - Given the mixed nature of data, normalization and bucketing of features can make association rule mining a relatively direct approach to categorization, not requiring too much parameter tuning or feature transformation. We may attempt to utilize the apriori or eclat algorithm depending on the size of data set available</span></li>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Gradient boosted trees - In order to handle the likely collinearity between several independent variables, a tree-based approach seems like a reasonable approach. </span></li>
| |
− | <li style="font-weight: 400;"><span style="font-weight: 400;">Naive Bayes - Similar to the regression problem, a Bayesian approach in categorization helps us avoid problems in overfitting that models like logistic regression might face due to smaller data sets relative to the number of features being categorized.</span></li>
| |
− | </ol>
| |
− | <p><br /><br /></p>
| |
− | <p><b><span style="color:#800000">Cross-validation & Testing</span></b></p>
| |
− | <p><span style="font-weight: 400;">All the models are to be trained on a fixed set of data, i.e. we will perform a 60-20-20 split on the available data, and reserve one of the 20% sets for cross validation to select the best performing model.</span></p>
| |
− | <p><span style="font-weight: 400;">Finally, the selected model will be tested on the final 20% data set to check for overfitting to the validation set, to determine it’s predictive capability.</span></p>
| |
− | <p><span style="font-weight: 400;">Since design-and-build happens in batches, we will also examine whether cross-training the model with data from different batches has improved or reduced success in prediction, to provide another measure of overfitting.</span></p>
| |
| | | |
| ==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica'; border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">STAKEHOLDERS</font></div></div>== | | ==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica'; border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">STAKEHOLDERS</font></div></div>== |
Line 114: |
Line 47: |
| | | |
| <p><b><span style="color:#800000">Supervisor:</span></b> [https://sis.smu.edu.sg/faculty/profile/9618 Prof. Kam Tin Seong]</p> | | <p><b><span style="color:#800000">Supervisor:</span></b> [https://sis.smu.edu.sg/faculty/profile/9618 Prof. Kam Tin Seong]</p> |
− |
| |
− | <p><b><span style="color:#800000">Sponsor:</span></b> Arup Singapore Pte. Ltd - [https://www.linkedin.com/in/chris-deakin-50370613/ Chris Deakin]: Rail Leader, Digital Engineering Leader</p>
| |
DonorsChoose.org is a non-profit organisation seeking to improve the education system in America. The DonorsChoose.org platform is a civic crowdfunding platform that allows public school teachers across America to reach out to potential donors. Donors can choose the type of projects to fund and can donate any amount to the cause of their liking. Since its founding in 2000, the platform has funded over a million projects and benefited over 27 million students.
The process starts with teachers submitting their project proposals, detailing the resources and materials they require and how the resources will benefit their students. Upon submission to DonorsChoose.org, volunteers will review the project submission and determine whether it can be approved.As the number of project submissions is expected to increase beyond 500,000 in 2018, DonorsChoose.org has to scale their efforts in the project approval process as well. A prediction model will help facilitate the process, but close attention has to be paid to the model such that it can selectively discern deserving projects.
Key Objectives
The objective our project would be to develop a model for DonorsChoose.org to predict the likely approval status of projects submitted by teachers. Based on past data on the project, the teacher and the school, we would seek to build a model that could determine the projects’ approval rate. Ideally, the model would have high precision and recall rate.
The expected outcome would be for DonorsChoose.org to automate part of its project screening process, and redirect efforts into examining proposals that need more assistance. The project will also shed insight into how organisations can utilize prediction models to scale manual processes in an efficient manner while preserving accuracy.
Students: Friedemann & Josh
Supervisor: Prof. Kam Tin Seong