ANLY482 AY2017-18T2 Group02 Project Overview

From Analytics Practicum
Revision as of 16:27, 14 April 2018 by Xzho.2014 (talk | contribs)
Jump to navigation Jump to search
Home   Project Overview   Findings & Insights   Documentation   Project Management   Back to project list



PROJECT BACKGROUND

DonorsChoose.org is a non-profit organisation seeking to improve the education system in America. The DonorsChoose.org platform is a civic crowdfunding platform that allows public school teachers across America to reach out to potential donors. Donors can choose the type of projects to fund and can donate any amount to the cause of their liking. Since its founding in 2000, the platform has funded over a million projects and benefited over 27 million students.

The process starts with teachers submitting their project proposals, detailing the resources and materials they require and how the resources will benefit their students. Upon submission to DonorsChoose.org, volunteers will review the project submission and determine whether it can be approved.As the number of project submissions is expected to increase beyond 500,000 in 2018, DonorsChoose.org has to scale their efforts in the project approval process as well. A prediction model will help facilitate the process, but close attention has to be paid to the model such that it can selectively discern deserving projects.

OBJECTIVES

Key Objectives

The objective our project would be to develop a model for DonorsChoose.org to predict the likely approval status of projects submitted by teachers. Based on past data on the project, the teacher and the school, we would seek to build a model that could determine the projects’ approval rate. Ideally, the model would have high precision and recall rate.

The expected outcome would be for DonorsChoose.org to automate part of its project screening process, and redirect efforts into examining proposals that need more assistance. The project will also shed insight into how organisations can utilize prediction models to scale manual processes in an efficient manner while preserving accuracy.

METHODOLOGY

While we are still awaiting our data, we have decided on 3 steps to go about processing our data, Cleaning, Training and Cross-validation.

Clean

  1. Convert text-based data into categorical/numerical features
  2. Check all features for missing values, and treat as appropriate in each case
  3. Check features for collinearity - we will only reduce dimensions right before training, because certain models do not require removal of collinear variables, and in certain case
  4. Identify sparse features as potential features to be dropped or specially treated depending on the model used
  5. Perform normalization or binning on skewed data, as appropriate

 

Train

There are effectively two separate problems we are trying to solve:

i) Regression to determine design parameters

ii) Categorization to predict which features will require redesign

As such, we will discuss the methodology for these two separately.

 

A) Predicting design parameters

Since the feature we are trying to predict is likely to be numerical, we have to attempt some kind of regression or curve fitting approach.

 

B) Predicting redesign

Since the feature we are trying to predict is categorical, we can attempt a variety of categorization approaches.

 

  1. Logistic regression - Since the dependent variable is binary - either it does or does not require redesign, logistic regression is a natural choice.
  2. Association rule mining - Given the mixed nature of data, normalization and bucketing of features can make association rule mining a relatively direct approach to categorization, not requiring too much parameter tuning or feature transformation. We may attempt to utilize the apriori or eclat algorithm depending on the size of data set available
  3. Gradient boosted trees - In order to handle the likely collinearity between several independent variables, a tree-based approach seems like a reasonable approach.
  4. Naive Bayes - Similar to the regression problem, a Bayesian approach in categorization helps us avoid problems in overfitting that models like logistic regression might face due to smaller data sets relative to the number of features being categorized.



Cross-validation & Testing

All the models are to be trained on a fixed set of data, i.e. we will perform a 60-20-20 split on the available data, and reserve one of the 20% sets for cross validation to select the best performing model.

Finally, the selected model will be tested on the final 20% data set to check for overfitting to the validation set, to determine it’s predictive capability.

Since design-and-build happens in batches, we will also examine whether cross-training the model with data from different batches has improved or reduced success in prediction, to provide another measure of overfitting.

STAKEHOLDERS

Students: Friedemann & Josh

Supervisor: Prof. Kam Tin Seong