ANLY482 AY2017-18T2 Group02 Findings & Insights

From Analytics Practicum
Revision as of 09:41, 14 April 2018 by Xzho.2014 (talk | contribs)
Jump to navigation Jump to search
Home   Project Overview   Findings & Insights   Documentation   Project Management   Back to project list



DATA PREPARATION

Data cleaning

Out of the 182,080 entries in train.csv, 3 columns had missing data. In teacher_prefix, there were 4 missing points, while in project_essay_3 and project_essay_4, there were 175706 missing entries. For teacher_prefix, the empty entries were replaced with Unknown. For the missing project essays, we understood from the project brief that submissions after May 17 2016 only required submissions to submit project_essay_1 and project_essay_2. The change in submission format also meant that projects submitted before and after May 17 could not be compared on the same basis. Noting that 175,706 entries were submitted after, and only 6374 entries were submitted before, we decided to remove the 6374 entries submitted before May 17 such that our analysis would be uniform. We were left with 175,706 complete project entries.

Identification of Response Variable

In our dataset, our response variable is whether the project is approved or not. This is represented in the column project_is_approved, which can hold values of either 0 (not approved) or 1 (approved). We changed the column data type into a nominal data type. A simple distribution of the response variable shows us that projects are 84.69% approved and 15.31% failed. Hence, our investigative efforts will be into why certain projects fail and to engineer features that are representative of failed projects.


FEATURE ENGINEERING

In our dataset, we were provided with features on the project submission, including information on the school, the teacher, the project and the resources requested for the project. There is a mix of categorical, continuous, time series and text columns.

Date

We had a column, date of project submission coded in y/m/d h:m:s format. To extract more information, we generated 2 additional columns on the month of project submission and day of week of project submission. However, when tested against our response variable, project approval status, the month and day of week of project submission showed no notable variation.

Region

We were provided with the state which the project was submitted from. When plotted against our response variable on a Mosaic plot, we noted that several states such as Texas had higher failure rates. We also recoded the state in 4 regions in the US, NorthEast, Midwest, West & South.

Project Category & Subcategory

Project categories are categorical variables. We noted that the entries contained 2 categories, separated by a comma. For example, we had project categories under "Health & Sports, Language & Literacy". As the ordering of these categories may contain meaning, we decided to classify the first mentioned category as the primary category and the second one as the secondary category. We managed to do this using the ‘text to columns’ function on JMP to obtain the primary and secondary project categories as follows:

Applied Learning

Health & Sports

History & Civics

Language & Literacy

Math & Science

Music & The Arts

Special Needs

Warmth, Care & Hunger

We performed the same transformation on Project Subcategory to obtain 27 subcategories

Text features

MODEL SELECTION

MODEL INTERPRETATION