Difference between revisions of "APA Feature Engineering"

From Analytics Practicum
Jump to navigation Jump to search
(Created page with "<center> 300px </center> <font face="Century Gothic"> {| style="background-color:#FFFFFF; color:#66ffcc padding: 5px 0 0 0;" width="100%" cellspacing="...")
 
Line 28: Line 28:
  
 
<P>
 
<P>
'''Subject Line weightage'''
+
'''Subject Line weightage:'''
<br>
 
 
We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows:  
 
We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows:  
 
# First run an analysis on all the terms occurring in the entire dataset
 
# First run an analysis on all the terms occurring in the entire dataset
Line 40: Line 39:
 
# Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business
 
# Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business
 
# Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.  
 
# Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.  
<P>
+
<br>
 +
 
 +
'''isCC? isForward? isReply?:'''<br>
 +
* isCC?: Will indicate whether the email was CC-ed to this recipient
 +
* isForward?: Will indicate whether the email was forwarded to the recipient
 +
* isReply?: Will indicate whether the email was a reply

Revision as of 16:33, 17 February 2017

APA logo.png

HOME

 

PROJECT OVERVIEW

 

PROJECT FINDINGS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 

FEATURE ENGINEERING

 


Subject Line weightage: We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows:

  1. First run an analysis on all the terms occurring in the entire dataset
  • This analysis will filter out common words, prepositions and other unimportant words that could potentially skew the results.
  • The analysis will return a listen of words along with the frequency of the term’s occurrence in the dataset.
  1. Based on the results obtained, we would like to calculate the tf-idf of each term.
  • tf: how often does the term occur in the document
  • idf: how often does the term occur in other documents
  • tf-idf will allow us to find the most important terms in the set of documents
  1. Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business
  2. Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.


isCC? isForward? isReply?:

  • isCC?: Will indicate whether the email was CC-ed to this recipient
  • isForward?: Will indicate whether the email was forwarded to the recipient
  • isReply?: Will indicate whether the email was a reply