ANLY482 AY2017-18T2 Group02 Findings & Insights
Home | Project Overview | Findings & Insights | Documentation | Project Management | Back to project list |
Contents
DATA PREPARATION
Data cleaning
Out of the 182,080 entries in train.csv, 3 columns had missing data. In teacher_prefix, there were 4 missing points, while in project_essay_3 and project_essay_4, there were 175706 missing entries. For teacher_prefix, the empty entries were replaced with Unknown. For the missing project essays, we understood from the project brief that submissions after May 17 2016 only required submissions to submit project_essay_1 and project_essay_2. The change in submission format also meant that projects submitted before and after May 17 could not be compared on the same basis. Noting that 175,706 entries were submitted after, and only 6374 entries were submitted before, we decided to remove the 6374 entries submitted before May 17 such that our analysis would be uniform. We were left with 175,706 complete project entries.
Identification of Response Variable
In our dataset, our response variable is whether the project is approved or not. This is represented in the column project_is_approved, which can hold values of either 0 (not approved) or 1 (approved). We changed the column data type into a nominal data type. A simple distribution of the response variable shows us that projects are 84.69% approved and 15.31% failed. Hence, our investigative efforts will be into why certain projects fail and to engineer features that are representative of failed projects.
FEATURE ENGINEERING
In our dataset, we were provided with features on the project submission, including information on the school, the teacher, the project and the resources requested for the project. There is a mix of categorical, continuous, time series and text columns.
Date
We had a column, date of project submission coded in y/m/d h:m:s format. To extract more information, we generated 2 additional columns on the month of project submission and day of week of project submission. However, when tested against our response variable, project approval status, the month and day of week of project submission showed no notable variation.
Region
We were provided with the state which the project was submitted from. When plotted against our response variable on a Mosaic plot, we noted that several states such as Texas had higher failure rates. We also recoded the state in 4 regions in the US, NorthEast, Midwest, West & South.
Teacher Characteristics
We had information on the teacher's title, which could be either Mr., Ms., Mrs., Dr. or Teacher. We decided to recode these titles into gender as well, with Dr. and Teacher falling into the Unknown category.
Project category & subcategory
Project categories are categorical variables. We noted that the entries contained 2 categories, separated by a comma. For example, we had project categories under "Health & Sports, Language & Literacy". As the ordering of these categories may contain meaning, we decided to classify the first mentioned category as the primary category and the second one as the secondary category. We managed to do this using the ‘text to columns’ function on JMP to obtain the primary and secondary project categories as follows:
Applied Learning |
Health & Sports |
History & Civics |
Language & Literacy |
Math & Science |
Music & The Arts |
Special Needs |
Warmth, Care & Hunger |
We performed the same transformation on Project Subcategory to obtain 27 subcategories
Resources
In our resources csv file, we noted that each project submission, tagged by the project ID, can have multiple resources requested. Hence we decided to invertigate 4 features, the total price of the resources requested, the total quantity, the average price per quantity and the no. of distinct items requested. These 4 features were then joined to our main csv file via the project ID.
In addition, the description column in the resources csv file contained text on the type of resource requested. This included specific information on the item name, brand and in certain cases the model as well. The JMP Text Explorer function was performed on the column, giving us the top 8 commonly requested items. Dummy variables for these 8 items were created as well. The 8 items and their frequency count are as below:
|
wobble chair |
ipad mini |
dry erase |
balance ball |
complete set |
10 subscriptions |
book set |
construction paper |
Count |
7471 |
7318 |
5820 |
5727 |
5357 |
3721 |
2501 |
1955 |
Text Features
We had a total of 4 text columns, the Project Title, Project essay 1, Project essay 2 and the Project Resource Summary, all of which require teachers to provide input. For the Project Essay 1, it requires teachers to describe the current state of their students and the school. For Project Essay 2, it requires teachers to provide details on how the resources requested will benefit their students. Our hypothesis is that investigation of Project Essay 2 will provide more representative features as its requirements are more specific to the project.
No. of characters
We obtained the no. of characters for text data to observe if length of titles and essays would affect approval rate.
Document Term Matrix (DTM)
These were the steps involved in identifying representative phrases from the DTM
- Stemming and removal of stopwords
- Obtain Document Term Matrix for failed and approved projects separately via JMP Pro Text Explorer
- Identify representative phrases that occur in the top 20 most frequent phrases for failed projects but do not appear in approved projects
- Create dummy variables for these representative phrases
Latent Class Analysis Clustering
We also performed Latent Class Analysis Clustering on the text columns. Using Project essay 2 as an example, the clusters were identified and provided with a label depending on the most frequent occuring words in the cluster. The cluster labels for project essay 2 are as follows:
Cluster 1 |
Cluster 2 |
Cluster 3 |
Cluster 4 |
Cluster 5 |
Cluster 6 |
Cluster 7 |
|
Label |
Technology access projects |
Creative science projects |
Projects requesting for supplies |
Reading projects |
Seating mobility projects |
Active play projects |
Math skill projects |
Each project was then assigned to their most likely cluster.
SVD Topic analysis
As Project essay 2 was recognised to be an influential text feature, we decided to perform Single Value Decomposition (SVD) topic analysis on Project essay 2. We obtained 10 separate topics and each project was assigned a topic score to each topic. This generated 10 additional topic columns.
LIST OF FEATURES
Original Features |
Classification |
Remarks |
|
1 |
teacher_prefix |
Category |
|
2 |
school_state |
Category |
|
3 |
project_grade_category |
Category |
|
4 |
teacher_number_of_previously_submitted_projects |
Continuous |
|
Original Features |
Classification |
Remarks |
|
5 |
gender |
Category |
|
6 |
region |
Category |
|
7 |
day of week |
Category |
|
8 |
month |
Category |
|
9 |
primary category |
Category |
|
10 |
secondary category |
Category |
|
11 |
primary subcategory |
Category |
|
12 |
secondary subcategory |
Category |
|
13 |
no. of distinict resources |
Continuous |
|
14 |
Sum(price) |
Continuous |
|
15 |
Sum(quantity) |
Continuous |
|
16 |
Price/Qty |
Continuous |
|
17 |
ipad mini |
Binary |
|
18 |
wobble chair |
Binary |
|
19 |
book set |
Binary |
|
20 |
dry erase |
Binary |
|
21 |
10 subscriptions |
Binary |
|
22 |
balance ball |
Binary |
|
23 |
complete set |
Binary |
|
24 |
construction paper |
Binary |
|
25 |
Length[project_title] |
Continuous |
|
26 |
Length[project_essay_1] |
Continuous |
|
27 |
Length[project_essay_2] |
Continuous |
|
28 |
Length[project_resource_summary] |
Continuous |
|
29 |
Project Title LCA Most Likely Cluster |
Category |
|
30 |
Essay 1 LCA Most Likely Cluster |
Category |
|
31 |
Essay 2 LCA Most Likely Cluster |
Category |
|
32 |
Project Resource LCA Most Likely Cluster |
Category |
|
33 |
hands on learning |
Binary |
Representative phrase in project title for failed projects |
34 |
school supplies |
Binary |
Representative phrase in project title for failed projects |
35 |
learning environment |
Binary |
Representative phrase in project title for failed projects |
36 |
materials will help |
Binary |
Representative phrase in project essay 2 for failed projects |
37 |
art supplies |
Binary |
Representative phrase in project resource summary for failed projects |
38 |
Topic 1- Flexible seating |
Continuous |
Project essay 2 topic |
39 |
Topic 2- Creative art crafts |
Continuous |
Project essay 2 topic |
40 |
Topic 3- Healthy lifestyle |
Continuous |
Project essay 2 topic |
41 |
Topic 4- Book reading |
Continuous |
Project essay 2 topic |
42 |
Topic 5- Literacy in words & math |
Continuous |
Project essay 2 topic |
43 |
Topic 6- School supplies |
Continuous |
Project essay 2 topic |
44 |
Topic 7- Technology access |
Continuous |
Project essay 2 topic |
45 |
Topic 8- Academic development |
Continuous |
Project essay 2 topic |
46 |
Topic 9- Learning environment |
Continuous |
Project essay 2 topic |
47 |
Topic 10- Engineering |
Continuous |
Project essay 2 topic |
MODEL SELECTION
For our prediction problem, we would like to utilise tree-based models. Given Sharma(2009)’s experience with ensemble tree-based models against direct discriminative models such as logistic regression, we decided to explore this approach with some added diversity. A random forest bootstrapped with random samples will allow us to build an improved model thanks to multiple iterations refining the ranking of determinant variables.
In addition, we will utilize JMP Pro’s boosted trees. Boosting is based on weak learners, i.e. shallow trees instead of fully grown ones that are utilized in random forests. In this way, we attempt to reduce bias(overfitting), as a counter-model to the random forest approach which instead accepts bias in order to reduce variance.
MODEL INTERPRETATION
Training
|
Validation
|
Test
|
Confusion matrix for bootstrap forest
Training
|
Validation
|
Test
|
Table 14. Confusion matrix for boosted trees