Difference between revisions of "ANLY482 AY2017-18T2 Group02 Findings & Insights"

From Analytics Practicum
Jump to navigation Jump to search
m
 
(42 intermediate revisions by 2 users not shown)
Line 12: Line 12:
  
 
| style="border-bottom:4px solid #2f2929; border-top:5px solid #2f2929; background:none;" width="1%" |    
 
| style="border-bottom:4px solid #2f2929; border-top:5px solid #2f2929; background:none;" width="1%" |    
| style="padding:0.4em; font-size:90%; background-color:#ffffff;  border-bottom:4px solid #2f2929; border-top:5px solid #2f2929; text-align:center; color:#2f2929" width="10%" |[[ANLY482 AY2017-18T2 Group02 Analytics Reflection|<font color="#3d3d3d" size=2><b>Reflection</b></font>]]
+
| style="padding:0.4em; font-size:90%; background-color:#ffffff;  border-bottom:4px solid #2f2929; border-top:5px solid #2f2929; text-align:center; color:#2f2929" width="10%" |[[ANLY482 AY2017-18T2 Group02 Project Management|<font color="#3d3d3d" size=2><b>Project Management</b></font>]]
 +
 
 +
| style="border-bottom:4px solid #2f2929; border-top:5px solid #2f2929; background:none;" width="1%" | &nbsp;
 +
| style="padding:0.4em; font-size:90%; background-color:#ffffff;  border-bottom:4px solid #2f2929; border-top:5px solid #2f2929; text-align:center; color:#2f2929" width="10%" |[[ANLY482 AY2017-18 Term 2 |<font color="#3d3d3d"><b>Back to project list</b></font>]]
  
 
|} <br>
 
|} <br>
  
<div style="margin:20px; padding: 10px; background: #ffffff; font-family: Trebuchet MS, sans-serif; font-size: 95%;-webkit-border-radius: 15px;-webkit-box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96); -moz-box-shadow:    7px 4px 14px rgba(176, 155, 121, 0.96);box-shadow: 7px 4px 14px rgba(176, 155, 121, 0.96);">
 
<font size =3 face=Georgia >
 
<b><span style="color:#FF91A4">Thesis, data outline, methodology, and results</span></b>
 
</font>
 
<p>Now and then, a century egg competes with the long bean. You see my husband's not at home lah. When another gone chandu , a Hokchew kueh bangkit . Her price is too high for me lah. Sometimes another half past six bag of balls , but a kalang kabut Hari Raya Haji always brainwashes the gu niang inside the I see! Can go to my place le. Indeed, an atas face gives a pink slip to the mee pok near a fire in the hole. So tired le.<br/>
 
  
Going Toa Payoh. Are you sick? For example, a chau peng defined by some langgar indicates that a mui choy steals pencils from the kachang related to a duku dumpling. Wah! lucky we didn't step on it. The kaypoh defined by the buah duku is gao ding. Why you so liddat ar? The hainanese chiken rice related to an ah kor caricatures some chum briyani mutton. How to fix?<br/>
 
  
When you see a slyly chia hong chia lat kiam chye, it means that a langsat gets stinking drunk. You jiak at hawker centre izzit? Walau, don't fly my aeroplane. Indeed, a chio chim falls in love with a lelong. You know what happen lah. Fine. The ice kacang inside an abndung accidentally makes a truce with a bo min abang. And then how many rooms ah? You want to swim, then swim here.
+
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">DATA PREPARATION</font></div></div>==
<br/><br/>
+
 
Singlish Lorem Ipsum created at: https://s3-ap-southeast-1.amazonaws.com/lorem-ipsum/index.html</p>
+
 
 +
<b><span style="color:#800000">Data cleaning</span></b>
 +
<p>
 +
Out of the 182,080 entries in train.csv, 3 columns had missing data. In teacher_prefix, there were 4 missing points, while in project_essay_3 and project_essay_4, there were 175706 missing entries. For teacher_prefix, the empty entries were replaced with Unknown. For the missing project essays, we understood from the project brief that submissions after May 17 2016 only required submissions to submit project_essay_1 and project_essay_2. The change in submission format also meant that projects submitted before and after May 17 could not be compared on the same basis. Noting that 175,706 entries were submitted after, and only 6374 entries were submitted before, we decided to remove the 6374 entries submitted before May 17 such that our analysis would be uniform. We were left with 175,706 complete project entries.
 +
</p>
 +
 
 +
<b><span style="color:#800000">Identification of Response Variable</span></b>
 +
 
 +
<p>
 +
In our dataset, our response variable is whether the project is approved or not. This is represented in the column project_is_approved, which can hold values of either 0 (not approved) or 1 (approved). We changed the column data type into a nominal data type. A simple distribution of the response variable shows us that projects are 84.69% approved and 15.31% failed. Hence, our investigative efforts will be into why certain projects fail and to engineer features that are representative of failed projects.
 +
</p>
 +
 
 +
 
 +
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">FEATURE ENGINEERING</font></div></div>==
 +
<p>
 +
In our dataset, we were provided with features on the project submission, including information on the school, the teacher, the project and the resources requested for the project.
 +
There is a mix of categorical, continuous, time series and text columns.
 +
</p>
 +
<b><span style="color:#800000">Date</span></b>
 +
<p>
 +
We had a column, date of project submission coded in y/m/d h:m:s format. To extract more information, we generated 2 additional columns on the month of project submission and day of week of project submission.
 +
</p>
 +
<b><span style="color:#800000">Region</span></b>
 +
<p>
 +
We were provided with the state which the project was submitted from. In addition to the state, we recoded the state into 4 regions in the US, NorthEast, Midwest, West & South, based on their location.
 +
</p>
 +
<b><span style="color:#800000">Teacher Characteristics</span></b>
 +
<p>
 +
We had information on the teacher's title, which could be either Mr., Ms., Mrs., Dr. or Teacher. We decided to recode these titles into gender as well, with Dr. and Teacher falling into the Unknown category.
 +
</p>
 +
<b><span style="color:#800000">Project category & subcategory</span></b>
 +
<p>
 +
Project categories are categorical variables. We noted that the entries contained 2 categories, separated by a comma. For example, we had project categories under "Health & Sports, Language & Literacy". As the ordering of these categories may contain meaning, we decided to classify the first mentioned category as the primary category and the second one as the secondary category. We managed to do this using the ‘text to columns’ function on JMP to obtain the primary and secondary project categories as follows:
 +
</p>
 +
 
 +
<table style="width: 1020px; margin-left: auto; margin-right: auto;" border="2">
 +
<tr>
 +
<td style="text-align: center;" width="78">
 +
<p>Applied Learning</p>
 +
</td>
 +
<td style="text-align: center;" width="78">
 +
<p>Health &amp; Sports</p>
 +
</td>
 +
<td style="text-align: center;" width="78">
 +
<p>History &amp; Civics</p>
 +
</td>
 +
<td style="text-align: center;" width="78">
 +
<p>Language &amp; Literacy</p>
 +
</td>
 +
<td style="text-align: center;" width="78">
 +
<p>Math &amp; Science</p>
 +
</td>
 +
<td style="text-align: center;" width="78">
 +
<p>Music &amp; The Arts</p>
 +
</td>
 +
<td style="text-align: center;" width="78">
 +
<p>Special Needs</p>
 +
</td>
 +
<td style="text-align: center;" width="78">
 +
<p>Warmth, Care &amp; Hunger</p>
 +
</td>
 +
</tr>
 +
</table>
 +
 
 +
We performed the same transformation on Project Subcategory to obtain 27 subcategories
 +
 
 +
<b><span style="color:#800000">Resources</span></b>
 +
<p>
 +
In our resources csv file, we noted that each project submission, tagged by the project ID, can have multiple resources requested. Hence we decided to invertigate 4 features, the total price of the resources requested, the total quantity, the average price per quantity and the no. of distinct items requested. These 4 features were then joined to our main csv file via the project ID.
 +
</p>
 +
<p>In addition, the description column in the resources csv file contained text on the type of resource requested. This included specific information on the item name, brand and in certain cases the model as well. The JMP Text Explorer function was performed on the column, giving us the top 8 commonly requested items. Dummy variables for these 8 items were created as well. The 8 items and their frequency count are as below:</p>
 +
<table style="margin-left: auto; margin-right: auto;" border="2" width="1020">
 +
<tr>
 +
<td width="52">
 +
<p>&nbsp;</p>
 +
</td>
 +
<td width="58">
 +
<p>wobble chair</p>
 +
</td>
 +
<td width="49">
 +
<p>ipad mini</p>
 +
</td>
 +
<td width="52">
 +
<p>dry erase</p>
 +
</td>
 +
<td width="61">
 +
<p>balance ball</p>
 +
</td>
 +
<td width="67">
 +
<p>complete set</p>
 +
</td>
 +
<td width="86">
 +
<p>10 subscriptions</p>
 +
</td>
 +
<td width="55">
 +
<p>book set</p>
 +
</td>
 +
<td width="82">
 +
<p>construction paper</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="52">
 +
<p>Count</p>
 +
</td>
 +
<td width="58">
 +
<p>7471</p>
 +
</td>
 +
<td width="49">
 +
<p>7318</p>
 +
</td>
 +
<td width="52">
 +
<p>5820</p>
 +
</td>
 +
<td width="61">
 +
<p>5727</p>
 +
</td>
 +
<td width="67">
 +
<p>5357</p>
 +
</td>
 +
<td width="86">
 +
<p>3721</p>
 +
</td>
 +
<td width="55">
 +
<p>2501</p>
 +
</td>
 +
<td width="82">
 +
<p>1955</p>
 +
</td>
 +
</tr>
 +
</table>
 +
 
 +
<b><span style="color:#800000">Text Features</span></b>
 +
<p>
 +
We had a total of 4 text columns, the Project Title, Project essay 1, Project essay 2 and the Project Resource Summary, all of which require teachers to provide input. For the Project Essay 1, it requires teachers to describe the current state of their students and the school. For Project Essay 2, it requires teachers to provide details on how the resources requested will benefit their students. Our hypothesis is that investigation of Project Essay 2 will provide more representative features as its requirements are more specific to the project.
 +
</p>
 +
<p><strong>No. of characters</strong></p>
 +
<p>We obtained the no. of characters for text data to observe if length of titles and essays would affect approval rate.</p>
 +
<p><strong>Document Term Matrix (DTM)</strong></p>
 +
<p>These were the steps involved in identifying representative phrases from the DTM</p>
 +
<ol>
 +
<li>Stemming and removal of stopwords</li>
 +
<li>Obtain Document Term Matrix for failed and approved projects separately via JMP Pro Text Explorer</li>
 +
<li>Identify representative phrases that occur in the top 20 most frequent phrases for failed projects but do not appear in approved projects</li>
 +
<li>Create dummy variables for these representative phrases</li>
 +
</ol>
 +
<p><strong>Latent Class Analysis Clustering</strong></p>
 +
<p>We also performed Latent&nbsp;Class Analysis Clustering on the text columns. Using Project essay 2 as an example, the clusters were identified and provided with a label depending on the most frequent occuring&nbsp;words in the cluster. The cluster labels for project essay 2 are as follows:</p>
 +
<table style="margin-left: auto; margin-right: auto;" border="2">
 +
 
 +
<tr>
 +
<td width="71">&nbsp;</td>
 +
<td width="100">
 +
<p>Cluster 1</p>
 +
</td>
 +
<td width="85">
 +
<p>Cluster 2</p>
 +
</td>
 +
<td width="93">
 +
<p>Cluster 3</p>
 +
</td>
 +
<td width="85">
 +
<p>Cluster 4</p>
 +
</td>
 +
<td width="85">
 +
<p>Cluster 5</p>
 +
</td>
 +
<td width="85">
 +
<p>Cluster 6</p>
 +
</td>
 +
<td width="85">
 +
<p>Cluster 7</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="71">
 +
<p>Label</p>
 +
</td>
 +
<td width="90">
 +
<p>Technology access projects</p>
 +
</td>
 +
<td width="75">
 +
<p>Creative science projects</p>
 +
</td>
 +
<td width="83">
 +
<p>Projects requesting for supplies</p>
 +
</td>
 +
<td width="75">
 +
<p>Reading projects</p>
 +
</td>
 +
<td width="75">
 +
<p>Seating mobility projects</p>
 +
</td>
 +
<td width="75">
 +
<p>Active play projects</p>
 +
</td>
 +
<td width="75">
 +
<p>Math skill projects</p>
 +
</td>
 +
</tr>
 +
 
 +
</table>
 +
<p>Each project was then assigned to their most likely cluster.</p>
 +
<p><strong>SVD Topic analysis</strong></p>
 +
<p>As Project essay 2 was recognised to be an influential text feature, we decided to perform Single Value Decomposition (SVD) topic analysis on Project essay 2. We obtained 10 separate topics and each project was assigned a topic score to each topic. This generated 10 additional topic columns.</p>
 +
 
 +
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">LIST OF FEATURES</font></div></div>==
 +
 
 +
<table border="1" width="1020">
 +
<tr>
 +
<td width="54" bgcolor=#A9A9A9>&nbsp;</td>
 +
<td width="276" bgcolor=#A9A9A9>
 +
<p><strong>Original Features</strong></p>
 +
</td>
 +
<td width="84" bgcolor=#A9A9A9>
 +
<p><strong>Classification</strong></p>
 +
</td>
 +
<td width="210" bgcolor=#A9A9A9>
 +
<p><strong>Remarks</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>1</p>
 +
</td>
 +
<td width="276">
 +
<p>teacher_prefix</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>2</p>
 +
</td>
 +
<td width="276">
 +
<p>school_state</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>3</p>
 +
</td>
 +
<td width="276">
 +
<p>project_grade_category</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>4</p>
 +
</td>
 +
<td width="276">
 +
<p>teacher_number_of_previously_submitted_projects</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54" bgcolor=#A9A9A9>&nbsp;</td>
 +
<td width="276" bgcolor=#A9A9A9>
 +
<p><strong>Original Features</strong></p>
 +
</td>
 +
<td width="84" bgcolor=#A9A9A9>
 +
<p><strong>Classification</strong></p>
 +
</td>
 +
<td width="210" bgcolor=#A9A9A9>
 +
<p><strong>Remarks</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>5</p>
 +
</td>
 +
<td width="276">
 +
<p>gender</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>6</p>
 +
</td>
 +
<td width="276">
 +
<p>region</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>7</p>
 +
</td>
 +
<td width="276">
 +
<p>day of week</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>8</p>
 +
</td>
 +
<td width="276">
 +
<p>month</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>9</p>
 +
</td>
 +
<td width="276">
 +
<p>primary category</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>10</p>
 +
</td>
 +
<td width="276">
 +
<p>secondary category</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>11</p>
 +
</td>
 +
<td width="276">
 +
<p>primary subcategory</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>12</p>
 +
</td>
 +
<td width="276">
 +
<p>secondary subcategory</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>13</p>
 +
</td>
 +
<td width="276">
 +
<p>no. of distinict resources</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>14</p>
 +
</td>
 +
<td width="276">
 +
<p>Sum(price)</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>15</p>
 +
</td>
 +
<td width="276">
 +
<p>Sum(quantity)</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>16</p>
 +
</td>
 +
<td width="276">
 +
<p>Price/Qty</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>17</p>
 +
</td>
 +
<td width="276">
 +
<p>ipad mini</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>18</p>
 +
</td>
 +
<td width="276">
 +
<p>wobble chair</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>19</p>
 +
</td>
 +
<td width="276">
 +
<p>book set</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>20</p>
 +
</td>
 +
<td width="276">
 +
<p>dry erase</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>21</p>
 +
</td>
 +
<td width="276">
 +
<p>10 subscriptions</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>22</p>
 +
</td>
 +
<td width="276">
 +
<p>balance ball</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>23</p>
 +
</td>
 +
<td width="276">
 +
<p>complete set</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>24</p>
 +
</td>
 +
<td width="276">
 +
<p>construction paper</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>25</p>
 +
</td>
 +
<td width="276">
 +
<p>Length[project_title]</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>&nbsp;26</p>
 +
</td>
 +
<td width="276">
 +
<p>Length[project_essay_1]</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>&nbsp;27</p>
 +
</td>
 +
<td width="276">
 +
<p>Length[project_essay_2]</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>28</p>
 +
</td>
 +
<td width="276">
 +
<p>Length[project_resource_summary]</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>29</p>
 +
</td>
 +
<td width="276">
 +
<p>Project Title LCA Most Likely Cluster</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>30</p>
 +
</td>
 +
<td width="276">
 +
<p>Essay 1 LCA Most Likely Cluster</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>31</p>
 +
</td>
 +
<td width="276">
 +
<p>Essay 2 LCA Most Likely Cluster</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>32</p>
 +
</td>
 +
<td width="276">
 +
<p>Project Resource LCA Most Likely Cluster</p>
 +
</td>
 +
<td width="84">
 +
<p>Category</p>
 +
</td>
 +
<td width="210">
 +
<p>&nbsp;</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>33</p>
 +
</td>
 +
<td width="276">
 +
<p>hands on learning</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>Representative phrase in project title for failed projects</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>34</p>
 +
</td>
 +
<td width="276">
 +
<p>school supplies</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>Representative phrase in project title for failed projects</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>35</p>
 +
</td>
 +
<td width="276">
 +
<p>learning environment</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>Representative phrase in project title for failed projects</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>36</p>
 +
</td>
 +
<td width="276">
 +
<p>materials will help</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>Representative phrase in project essay 2 for failed projects</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>37</p>
 +
</td>
 +
<td width="276">
 +
<p>art supplies</p>
 +
</td>
 +
<td width="84">
 +
<p>Binary</p>
 +
</td>
 +
<td width="210">
 +
<p>Representative phrase in project resource summary for failed projects</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>38</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 1- Flexible seating</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>39</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 2- Creative art crafts</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>40</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 3- Healthy lifestyle</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>41</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 4- Book reading</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>42</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 5- Literacy in words &amp; math</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>43</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 6- School supplies</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>44</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 7- Technology access</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>45</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 8- Academic development</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>46</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 9- Learning environment</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td width="54">
 +
<p>47</p>
 +
</td>
 +
<td width="276">
 +
<p>Topic 10- Engineering</p>
 +
</td>
 +
<td width="84">
 +
<p>Continuous</p>
 +
</td>
 +
<td width="210">
 +
<p>Project essay 2 topic</p>
 +
</td>
 +
</tr>
 +
</table>
 +
 
 +
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">MODEL SELECTION</font></div></div>==
 +
<p>
 +
For our prediction problem, we would like to utilise tree-based models. Given Sharma’s experience with ensemble tree-based models against direct discriminative models such as logistic regression, we decided to explore this approach with some added diversity. A random forest bootstrapped with random samples will allow us to build an improved model thanks to multiple iterations refining the ranking of determinant variables.
 +
</p>
 +
<p>
 +
In addition, we will utilize JMP Pro’s boosted trees. Boosting is based on weak learners, i.e. shallow trees instead of fully grown ones that are utilized in random forests. In this way, we attempt to reduce bias(overfitting), as a counter-model to the random forest approach which instead accepts bias in order to reduce variance.
 +
</p>
 +
<p>
 +
The sample was split into 70% training set, 20% validation and 10% test set in order to maximize the number of failed projects in our training set. Within the training, validation and test sets, the distribution of success and failed projects follows our population distribution as the validation column was generated with a random formula on JMP Pro.
 +
</p>
 +
<p><font size="1">Reference: <i>Sharma, D. (n.d.). Retrieved from University of Edinburgh Business School: https://www.business-school.ed.ac.uk/crc/wp-content/uploads/sites/55/2017/02/Improving-Credit-Scoring-with-Random-Forests-Dhruv-Sharma.pdf</i></font></p>
 +
 
 +
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">MODEL INTERPRETATION</font></div></div>==
 +
<p>
 +
In our models, we will be examining its ability in predicting failed projects. An ideal model should have a high ability to predict failed projects and low misclassification rate. We would be using precision rate (How many projects predicted to fail actually are failed projects) and recall rate (How many actual failed projects were selected?) to compare the models.
 +
</p>
 +
<b><span style="color:#800000">Bootstrap Forest</span></b>
 +
 
 +
<table style="width: 753px; margin-left: auto; margin-right: auto;">
 +
<tr style="height: 137px;">
 +
<td style="height: 137px; width: 243px;">
 +
<p style="text-align: center;">Training</p>
 +
<table style="width: 243px;" border="1">
 +
<tr style="height: 35px;">
 +
<td style="height: 35px; width: 128px; text-align: center;">
 +
<p><strong>Actual</strong></p>
 +
</td>
 +
<td style="height: 35px; width: 123px;" colspan="2">
 +
<p style="text-align: center;"><strong>Predicted Count</strong></p>
 +
</td>
 +
</tr>
 +
<tr style="height: 35px;">
 +
<td style="height: 35px; width: 128px; text-align: center;">
 +
<p><strong>project_is_approved</strong></p>
 +
</td>
 +
<td style="height: 35px; width: 79px; text-align: center;">
 +
<p><strong>0</strong></p>
 +
</td>
 +
<td style="height: 35px; width: 44px; text-align: center;">
 +
<p><strong>1</strong></p>
 +
</td>
 +
</tr>
 +
<tr style="height: 24px;">
 +
<td style="height: 24px; width: 128px; text-align: center;">
 +
<p>0</p>
 +
</td>
 +
<td style="height: 24px; width: 79px; text-align: center;">
 +
<p>566</p>
 +
</td>
 +
<td style="height: 24px; width: 44px; text-align: center;">
 +
<p>18179</p>
 +
</td>
 +
</tr>
 +
<tr style="height: 35px;">
 +
<td style="height: 35px; width: 128px; text-align: center;">
 +
<p>1</p>
 +
</td>
 +
<td style="height: 35px; width: 79px; text-align: center;">
 +
<p>169</p>
 +
</td>
 +
<td style="height: 35px; width: 44px; text-align: center;">
 +
<p>103884</p>
 +
</td>
 +
</tr>
 +
</table>
 +
</td>
 +
<td style="height: 137px; width: 244px;">
 +
<p style="text-align: center;">Validation</p>
 +
<table style="width: 244px;" border="1">
 +
<tr style="height: 0px;">
 +
<td style="height: 0px; width: 128px; text-align: center;">
 +
<p><strong>Actual</strong></p>
 +
</td>
 +
<td style="height: 0px; width: 100px;" colspan="2">
 +
<p style="text-align: center;"><strong>Predicted Count</strong></p>
 +
</td>
 +
</tr>
 +
<tr style="height: 35px;">
 +
<td style="height: 35px; width: 128px; text-align: center;">
 +
<p><strong>project_is_approved</strong></p>
 +
</td>
 +
<td style="height: 35px; width: 48px; text-align: center;">
 +
<p><strong>0</strong></p>
 +
</td>
 +
<td style="height: 35px; width: 52px; text-align: center;">
 +
<p><strong>1</strong></p>
 +
</td>
 +
</tr>
 +
<tr style="height: 35px;">
 +
<td style="height: 35px; width: 128px; text-align: center;">
 +
<p>0</p>
 +
</td>
 +
<td style="height: 35px; width: 48px; text-align: center;">
 +
<p>122</p>
 +
</td>
 +
<td style="height: 35px; width: 52px; text-align: center;">
 +
<p>5300</p>
 +
</td>
 +
</tr>
 +
<tr style="height: 35px;">
 +
<td style="height: 35px; width: 128px; text-align: center;">
 +
<p>1</p>
 +
</td>
 +
<td style="height: 35px; width: 48px; text-align: center;">
 +
<p>76</p>
 +
</td>
 +
<td style="height: 35px; width: 52px; text-align: center;">
 +
<p>29875</p>
 +
</td>
 +
</tr>
 +
</table>
 +
</td>
 +
<td style="height: 137px; width: 246px;">
 +
<p style="text-align: center;">Test</p>
 +
<table style="width: 245px;" border="1">
 +
<tr>
 +
<td style="width: 128px; text-align: center;">
 +
<p><strong>Actual</strong></p>
 +
</td>
 +
<td style="width: 111px; text-align: center;" colspan="2">
 +
<p><strong>Predicted Count</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px; text-align: center;">
 +
<p><strong>project_is_approved</strong></p>
 +
</td>
 +
<td style="width: 57px; text-align: center;">
 +
<p style="text-align: center;"><strong>0</strong></p>
 +
</td>
 +
<td style="width: 54px; text-align: center;">
 +
<p style="text-align: center;"><strong>1</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px; text-align: center;">
 +
<p>0</p>
 +
</td>
 +
<td style="width: 57px; text-align: center;">
 +
<p>66</p>
 +
</td>
 +
<td style="width: 54px; text-align: center;">
 +
<p>2677</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px; text-align: center;">
 +
<p>1</p>
 +
</td>
 +
<td style="width: 57px; text-align: center;">
 +
<p>46</p>
 +
</td>
 +
<td style="width: 54px; text-align: center;">
 +
<p>14746</p>
 +
</td>
 +
</tr>
 +
</table>
 +
</td>
 +
</tr>
 +
</table>
 +
<p>&nbsp;</p>
 +
<p style="text-align: center;"> Precision and recall rates </p>
 +
<table style="width: 252px; margin-left: auto; margin-right: auto;" border="1">
 +
<tr>
 +
<td style="width: 155px;">
 +
<p>&nbsp;</p>
 +
</td>
 +
<td style="width: 67px;">
 +
<p>Training</p>
 +
</td>
 +
<td style="width: 60px;">
 +
<p>Validation</p>
 +
</td>
 +
<td style="width: 10px;">
 +
<p>Test</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 155px;">
 +
<p>Precision</p>
 +
</td>
 +
<td style="width: 67px;">
 +
<p>0.770</p>
 +
</td>
 +
<td style="width: 60px;">
 +
<p>0.616</p>
 +
</td>
 +
<td style="width: 10px;">
 +
<p>0.589</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 155px;">
 +
<p>Recall</p>
 +
</td>
 +
<td style="width: 67px;">
 +
<p>0.030</p>
 +
</td>
 +
<td style="width: 60px;">
 +
<p>0.023</p>
 +
</td>
 +
<td style="width: 10px;">
 +
<p>0.024</p>
 +
</td>
 +
</tr>
 +
</table>
 +
<p>&nbsp;</p>
 +
<p>The classifier achieved using an ensemble of bootstrap forest has low recall but moderate to high precision: It’s very selective in rejecting a project, but consequently fails to flag many projects that should have been rejected. </p>
 +
 
 +
<!--END OF TABLES-->
 +
<b><span style="color:#800000">Boosted Trees</span></b>
 +
 
 +
<table style="width: 752px; margin-left: auto; margin-right: auto">
 +
<tr>
 +
<td style="text-align: center; width: 244px;">
 +
<p>Training</p>
 +
<table style="width: 245px;" border="1">
 +
<tr>
 +
<td style="width: 128px;">
 +
<p><strong>Actual</strong></p>
 +
</td>
 +
<td style="width: 111px;" colspan="2">
 +
<p><strong>Predicted Count</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p><strong>project_is_approved</strong></p>
 +
</td>
 +
<td style="width: 57px;">
 +
<p><strong>0</strong></p>
 +
</td>
 +
<td style="width: 54px;">
 +
<p><strong>1</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p>0</p>
 +
</td>
 +
<td style="width: 57px;">
 +
<p>437</p>
 +
</td>
 +
<td style="width: 54px;">
 +
<p>18308</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p>1</p>
 +
</td>
 +
<td style="width: 57px;">
 +
<p>208</p>
 +
</td>
 +
<td style="width: 54px;">
 +
<p>103845</p>
 +
</td>
 +
</tr>
 +
</table>
 +
</td>
 +
<td style="text-align: center; width: 245px;">
 +
<p>Validation</p>
 +
<table style="width: 245px;" border="1">
 +
<tr>
 +
<td style="width: 128px;">
 +
<p><strong>Actual</strong></p>
 +
</td>
 +
<td style="width: 99px;" colspan="2">
 +
<p><strong>Predicted Count</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p><strong>project_is_approved</strong></p>
 +
</td>
 +
<td style="width: 47px;">
 +
<p><strong>0</strong></p>
 +
</td>
 +
<td style="width: 52px;">
 +
<p><strong>1</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p>0</p>
 +
</td>
 +
<td style="width: 47px;">
 +
<p>128</p>
 +
</td>
 +
<td style="width: 52px;">
 +
<p>5294</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p>1</p>
 +
</td>
 +
<td style="width: 47px;">
 +
<p>77</p>
 +
</td>
 +
<td style="width: 52px;">
 +
<p>29874</p>
 +
</td>
 +
</tr>
 +
</table>
 +
</td>
 +
 
 +
<td style="text-align: center; width: 245px;">
 +
<p>Test</p>
 +
<table style="width: 243px;" border="1">
 +
<tr>
 +
<td style="width: 128px;">
 +
<p><strong>Actual</strong></p>
 +
</td>
 +
<td style="width: 97px;" colspan="2">
 +
<p><strong>Predicted Count</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p><strong>project_is_approved</strong></p>
 +
</td>
 +
<td style="width: 47px;">
 +
<p><strong>0</strong></p>
 +
</td>
 +
<td style="width: 50px;">
 +
<p><strong>1</strong></p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p>0</p>
 +
</td>
 +
<td style="width: 47px;">
 +
<p>68</p>
 +
</td>
 +
<td style="width: 50px;">
 +
<p>2675</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 128px;">
 +
<p>1</p>
 +
</td>
 +
<td style="width: 47px;">
 +
<p>41</p>
 +
</td>
 +
<td style="width: 50px;">
 +
<p>14751</p>
 +
</td>
 +
</tr>
 +
</table>
 +
</td>
 +
</tr>
 +
</table>
 +
<p>&nbsp;</p>
 +
<p style="text-align: center;">Precision and Recall rates</p>
 +
<table style="width: 252px; margin-left: auto; margin-right: auto;" border="1">
 +
<tr>
 +
<td style="width: 155px;">
 +
</td>
 +
<td style="width: 67px;">
 +
<p>Training</p>
 +
</td>
 +
<td style="width: 60px;">
 +
<p>Validation</p>
 +
</td>
 +
<td style="width: 10px;">
 +
<p>Test</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 155px;">
 +
<p>Precision</p>
 +
</td>
 +
<td style="width: 67px;">
 +
<p>0.678</p>
 +
</td>
 +
<td style="width: 60px;">
 +
<p>0.624</p>
 +
</td>
 +
<td style="width: 10px;">
 +
<p>0.624</p>
 +
</td>
 +
</tr>
 +
<tr>
 +
<td style="width: 155px;">
 +
<p>Recall</p>
 +
</td>
 +
<td style="width: 67px;">
 +
<p>0.023</p>
 +
</td>
 +
<td style="width: 60px;">
 +
<p>0.024</p>
 +
</td>
 +
<td style="width: 10px;">
 +
<p>0.024</p>
 +
</td>
 +
</tr>
 +
</table>
 +
<p>&nbsp;</p>
 +
<p>The classifier achieved using an ensemble of boosted trees also has low recall but moderate to high precision: It’s very selective in rejecting a project, but consequently fails to flag many projects that should have been rejected.</p>
  
</div>
+
==<div style="background: #800000; line-height: 0.5em; font-family:'Helvetica';  border-left: #FFB6C1 solid 15px;"><div style="border-left: #F2F1EF solid 5px; padding:15px;font-size:15px;"><font color= "#F2F1EF">CONCLUSION & RECOMMENDATION</font></div></div>==
 +
<p>
 +
Despite the focus on predicting project rejections instead of approvals, we find that the features available have limited direct predictive power. There may be a need to further engineer the available features to expose information that could be useful, or there may be hidden variables affecting project approval.
 +
</p>
 +
<p>
 +
To improve our model beyond what we have achieved, it seems necessary to utilize some more nuanced text analysis, including sentiment analysis, beyond standard text processing methods that we have attempted. If we are able to do this and also incorporate correlations between other parameters such as resources, and teachers’ experience, it might be possible to achieve a better predictive rate.
 +
</p>

Latest revision as of 16:25, 14 April 2018

Home   Project Overview   Findings & Insights   Documentation   Project Management   Back to project list



DATA PREPARATION

Data cleaning

Out of the 182,080 entries in train.csv, 3 columns had missing data. In teacher_prefix, there were 4 missing points, while in project_essay_3 and project_essay_4, there were 175706 missing entries. For teacher_prefix, the empty entries were replaced with Unknown. For the missing project essays, we understood from the project brief that submissions after May 17 2016 only required submissions to submit project_essay_1 and project_essay_2. The change in submission format also meant that projects submitted before and after May 17 could not be compared on the same basis. Noting that 175,706 entries were submitted after, and only 6374 entries were submitted before, we decided to remove the 6374 entries submitted before May 17 such that our analysis would be uniform. We were left with 175,706 complete project entries.

Identification of Response Variable

In our dataset, our response variable is whether the project is approved or not. This is represented in the column project_is_approved, which can hold values of either 0 (not approved) or 1 (approved). We changed the column data type into a nominal data type. A simple distribution of the response variable shows us that projects are 84.69% approved and 15.31% failed. Hence, our investigative efforts will be into why certain projects fail and to engineer features that are representative of failed projects.


FEATURE ENGINEERING

In our dataset, we were provided with features on the project submission, including information on the school, the teacher, the project and the resources requested for the project. There is a mix of categorical, continuous, time series and text columns.

Date

We had a column, date of project submission coded in y/m/d h:m:s format. To extract more information, we generated 2 additional columns on the month of project submission and day of week of project submission.

Region

We were provided with the state which the project was submitted from. In addition to the state, we recoded the state into 4 regions in the US, NorthEast, Midwest, West & South, based on their location.

Teacher Characteristics

We had information on the teacher's title, which could be either Mr., Ms., Mrs., Dr. or Teacher. We decided to recode these titles into gender as well, with Dr. and Teacher falling into the Unknown category.

Project category & subcategory

Project categories are categorical variables. We noted that the entries contained 2 categories, separated by a comma. For example, we had project categories under "Health & Sports, Language & Literacy". As the ordering of these categories may contain meaning, we decided to classify the first mentioned category as the primary category and the second one as the secondary category. We managed to do this using the ‘text to columns’ function on JMP to obtain the primary and secondary project categories as follows:

Applied Learning

Health & Sports

History & Civics

Language & Literacy

Math & Science

Music & The Arts

Special Needs

Warmth, Care & Hunger

We performed the same transformation on Project Subcategory to obtain 27 subcategories

Resources

In our resources csv file, we noted that each project submission, tagged by the project ID, can have multiple resources requested. Hence we decided to invertigate 4 features, the total price of the resources requested, the total quantity, the average price per quantity and the no. of distinct items requested. These 4 features were then joined to our main csv file via the project ID.

In addition, the description column in the resources csv file contained text on the type of resource requested. This included specific information on the item name, brand and in certain cases the model as well. The JMP Text Explorer function was performed on the column, giving us the top 8 commonly requested items. Dummy variables for these 8 items were created as well. The 8 items and their frequency count are as below:

 

wobble chair

ipad mini

dry erase

balance ball

complete set

10 subscriptions

book set

construction paper

Count

7471

7318

5820

5727

5357

3721

2501

1955

Text Features

We had a total of 4 text columns, the Project Title, Project essay 1, Project essay 2 and the Project Resource Summary, all of which require teachers to provide input. For the Project Essay 1, it requires teachers to describe the current state of their students and the school. For Project Essay 2, it requires teachers to provide details on how the resources requested will benefit their students. Our hypothesis is that investigation of Project Essay 2 will provide more representative features as its requirements are more specific to the project.

No. of characters

We obtained the no. of characters for text data to observe if length of titles and essays would affect approval rate.

Document Term Matrix (DTM)

These were the steps involved in identifying representative phrases from the DTM

  1. Stemming and removal of stopwords
  2. Obtain Document Term Matrix for failed and approved projects separately via JMP Pro Text Explorer
  3. Identify representative phrases that occur in the top 20 most frequent phrases for failed projects but do not appear in approved projects
  4. Create dummy variables for these representative phrases

Latent Class Analysis Clustering

We also performed Latent Class Analysis Clustering on the text columns. Using Project essay 2 as an example, the clusters were identified and provided with a label depending on the most frequent occuring words in the cluster. The cluster labels for project essay 2 are as follows:

 

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Label

Technology access projects

Creative science projects

Projects requesting for supplies

Reading projects

Seating mobility projects

Active play projects

Math skill projects

Each project was then assigned to their most likely cluster.

SVD Topic analysis

As Project essay 2 was recognised to be an influential text feature, we decided to perform Single Value Decomposition (SVD) topic analysis on Project essay 2. We obtained 10 separate topics and each project was assigned a topic score to each topic. This generated 10 additional topic columns.

LIST OF FEATURES

 

Original Features

Classification

Remarks

1

teacher_prefix

Category

 

2

school_state

Category

 

3

project_grade_category

Category

 

4

teacher_number_of_previously_submitted_projects

Continuous

 

 

Original Features

Classification

Remarks

5

gender

Category

 

6

region

Category

 

7

day of week

Category

 

8

month

Category

 

9

primary category

Category

 

10

secondary category

Category

 

11

primary subcategory

Category

 

12

secondary subcategory

Category

 

13

no. of distinict resources

Continuous

 

14

Sum(price)

Continuous

 

15

Sum(quantity)

Continuous

 

16

Price/Qty

Continuous

 

17

ipad mini

Binary

 

18

wobble chair

Binary

 

19

book set

Binary

 

20

dry erase

Binary

 

21

10 subscriptions

Binary

 

22

balance ball

Binary

 

23

complete set

Binary

 

24

construction paper

Binary

 

25

Length[project_title]

Continuous

 

 26

Length[project_essay_1]

Continuous

 

 27

Length[project_essay_2]

Continuous

 

28

Length[project_resource_summary]

Continuous

 

29

Project Title LCA Most Likely Cluster

Category

 

30

Essay 1 LCA Most Likely Cluster

Category

 

31

Essay 2 LCA Most Likely Cluster

Category

 

32

Project Resource LCA Most Likely Cluster

Category

 

33

hands on learning

Binary

Representative phrase in project title for failed projects

34

school supplies

Binary

Representative phrase in project title for failed projects

35

learning environment

Binary

Representative phrase in project title for failed projects

36

materials will help

Binary

Representative phrase in project essay 2 for failed projects

37

art supplies

Binary

Representative phrase in project resource summary for failed projects

38

Topic 1- Flexible seating

Continuous

Project essay 2 topic

39

Topic 2- Creative art crafts

Continuous

Project essay 2 topic

40

Topic 3- Healthy lifestyle

Continuous

Project essay 2 topic

41

Topic 4- Book reading

Continuous

Project essay 2 topic

42

Topic 5- Literacy in words & math

Continuous

Project essay 2 topic

43

Topic 6- School supplies

Continuous

Project essay 2 topic

44

Topic 7- Technology access

Continuous

Project essay 2 topic

45

Topic 8- Academic development

Continuous

Project essay 2 topic

46

Topic 9- Learning environment

Continuous

Project essay 2 topic

47

Topic 10- Engineering

Continuous

Project essay 2 topic

MODEL SELECTION

For our prediction problem, we would like to utilise tree-based models. Given Sharma’s experience with ensemble tree-based models against direct discriminative models such as logistic regression, we decided to explore this approach with some added diversity. A random forest bootstrapped with random samples will allow us to build an improved model thanks to multiple iterations refining the ranking of determinant variables.

In addition, we will utilize JMP Pro’s boosted trees. Boosting is based on weak learners, i.e. shallow trees instead of fully grown ones that are utilized in random forests. In this way, we attempt to reduce bias(overfitting), as a counter-model to the random forest approach which instead accepts bias in order to reduce variance.

The sample was split into 70% training set, 20% validation and 10% test set in order to maximize the number of failed projects in our training set. Within the training, validation and test sets, the distribution of success and failed projects follows our population distribution as the validation column was generated with a random formula on JMP Pro.

Reference: Sharma, D. (n.d.). Retrieved from University of Edinburgh Business School: https://www.business-school.ed.ac.uk/crc/wp-content/uploads/sites/55/2017/02/Improving-Credit-Scoring-with-Random-Forests-Dhruv-Sharma.pdf

MODEL INTERPRETATION

In our models, we will be examining its ability in predicting failed projects. An ideal model should have a high ability to predict failed projects and low misclassification rate. We would be using precision rate (How many projects predicted to fail actually are failed projects) and recall rate (How many actual failed projects were selected?) to compare the models.

Bootstrap Forest

Training

Actual

Predicted Count

project_is_approved

0

1

0

566

18179

1

169

103884

Validation

Actual

Predicted Count

project_is_approved

0

1

0

122

5300

1

76

29875

Test

Actual

Predicted Count

project_is_approved

0

1

0

66

2677

1

46

14746

 

Precision and recall rates

 

Training

Validation

Test

Precision

0.770

0.616

0.589

Recall

0.030

0.023

0.024

 

The classifier achieved using an ensemble of bootstrap forest has low recall but moderate to high precision: It’s very selective in rejecting a project, but consequently fails to flag many projects that should have been rejected.

Boosted Trees

Training

Actual

Predicted Count

project_is_approved

0

1

0

437

18308

1

208

103845

Validation

Actual

Predicted Count

project_is_approved

0

1

0

128

5294

1

77

29874

Test

Actual

Predicted Count

project_is_approved

0

1

0

68

2675

1

41

14751

 

Precision and Recall rates

Training

Validation

Test

Precision

0.678

0.624

0.624

Recall

0.023

0.024

0.024

 

The classifier achieved using an ensemble of boosted trees also has low recall but moderate to high precision: It’s very selective in rejecting a project, but consequently fails to flag many projects that should have been rejected.

CONCLUSION & RECOMMENDATION

Despite the focus on predicting project rejections instead of approvals, we find that the features available have limited direct predictive power. There may be a need to further engineer the available features to expose information that could be useful, or there may be hidden variables affecting project approval.

To improve our model beyond what we have achieved, it seems necessary to utilize some more nuanced text analysis, including sentiment analysis, beyond standard text processing methods that we have attempted. If we are able to do this and also incorporate correlations between other parameters such as resources, and teachers’ experience, it might be possible to achieve a better predictive rate.