Difference between revisions of "ANLY482 AY2017-18T2 Group02 Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
 
Line 41: Line 41:
 
</p>
 
</p>
  
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">METHODOLOGY</font></div></div>==
 
 
<p>While we are still awaiting our data, we have decided on 3 steps to go about processing our data, Cleaning, Training and Cross-validation.</p>
 
<p><b><span style="color:#800000">Clean</span></b></p>
 
<ol>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Convert text-based data into categorical/numerical features</span></li>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Check all features for missing values, and treat as appropriate in each case</span></li>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Check features for collinearity - we will only reduce dimensions right before training, because certain models do not require removal of collinear variables, and in certain case</span></li>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Identify sparse features as potential features to be dropped or specially treated depending on the model used</span></li>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Perform normalization or binning on skewed data, as appropriate</span></li>
 
</ol>
 
<p>&nbsp;</p>
 
<p><b><span style="color:#800000">Train</span></b></p>
 
<p><span style="font-weight: 400;">There are effectively two separate problems we are trying to solve: </span></p>
 
<p><span style="font-weight: 400;">i) Regression to determine design parameters</span></p>
 
<p><span style="font-weight: 400;">ii) Categorization to predict which features will require redesign</span></p>
 
<p><span style="font-weight: 400;">As such, we will discuss the methodology for these two separately.</span></p>
 
<p>&nbsp;</p>
 
<p><em><em><span style="font-weight: 400;">A) Predicting design parameters</span></em></em></p>
 
<p><span style="font-weight: 400;">Since the feature we are trying to predict is likely to be numerical, we have to attempt some kind of regression or curve fitting approach. </span></p>
 
<p>&nbsp;</p>
 
<p><em><span style="font-weight: 400;">B) Predicting redesign</span></em></p>
 
<p><span style="font-weight: 400;">Since the feature we are trying to predict is categorical, we can attempt a variety of categorization approaches.</span></p>
 
<p>&nbsp;</p>
 
<ol>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Logistic regression - Since the dependent variable is binary - either it does or does not require redesign, logistic regression is a natural choice. </span></li>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Association rule mining - Given the mixed nature of data, normalization and bucketing of features can make association rule mining a relatively direct approach to categorization, not requiring too much parameter tuning or feature transformation. We may attempt to utilize the apriori or eclat algorithm depending on the size of data set available</span></li>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Gradient boosted trees - In order to handle the likely collinearity between several independent variables, a tree-based approach seems like a reasonable approach. </span></li>
 
<li style="font-weight: 400;"><span style="font-weight: 400;">Naive Bayes - Similar to the regression problem, a Bayesian approach in categorization helps us avoid problems in overfitting that models like logistic regression might face due to smaller data sets relative to the number of features being categorized.</span></li>
 
</ol>
 
<p><br /><br /></p>
 
<p><b><span style="color:#800000">Cross-validation &amp; Testing</span></b></p>
 
<p><span style="font-weight: 400;">All the models are to be trained on a fixed set of data, i.e. we will perform a 60-20-20 split on the available data, and reserve one of the 20% sets for cross validation to select the best performing model.</span></p>
 
<p><span style="font-weight: 400;">Finally, the selected model will be tested on the final 20% data set to check for overfitting to the validation set, to determine it&rsquo;s predictive capability.</span></p>
 
<p><span style="font-weight: 400;">Since design-and-build happens in batches, we will also examine whether cross-training the model with data from different batches has improved or reduced success in prediction, to provide another measure of overfitting.</span></p>
 
  
 
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">STAKEHOLDERS</font></div></div>==
 
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">STAKEHOLDERS</font></div></div>==

Latest revision as of 09:00, 15 April 2018

Home   Project Overview   Findings & Insights   Documentation   Project Management   Back to project list



PROJECT BACKGROUND

DonorsChoose.org is a non-profit organisation seeking to improve the education system in America. The DonorsChoose.org platform is a civic crowdfunding platform that allows public school teachers across America to reach out to potential donors. Donors can choose the type of projects to fund and can donate any amount to the cause of their liking. Since its founding in 2000, the platform has funded over a million projects and benefited over 27 million students.

The process starts with teachers submitting their project proposals, detailing the resources and materials they require and how the resources will benefit their students. Upon submission to DonorsChoose.org, volunteers will review the project submission and determine whether it can be approved.As the number of project submissions is expected to increase beyond 500,000 in 2018, DonorsChoose.org has to scale their efforts in the project approval process as well. A prediction model will help facilitate the process, but close attention has to be paid to the model such that it can selectively discern deserving projects.

OBJECTIVES

Key Objectives

The objective our project would be to develop a model for DonorsChoose.org to predict the likely approval status of projects submitted by teachers. Based on past data on the project, the teacher and the school, we would seek to build a model that could determine the projects’ approval rate. Ideally, the model would have high precision and recall rate.

The expected outcome would be for DonorsChoose.org to automate part of its project screening process, and redirect efforts into examining proposals that need more assistance. The project will also shed insight into how organisations can utilize prediction models to scale manual processes in an efficient manner while preserving accuracy.


STAKEHOLDERS

Students: Friedemann & Josh

Supervisor: Prof. Kam Tin Seong