Difference between revisions of "APA Feature Engineering"
Jump to navigation
Jump to search
(Created page with "<center> 300px </center> <font face="Century Gothic"> {| style="background-color:#FFFFFF; color:#66ffcc padding: 5px 0 0 0;" width="100%" cellspacing="...") |
|||
Line 28: | Line 28: | ||
<P> | <P> | ||
− | '''Subject Line weightage''' | + | '''Subject Line weightage:''' |
− | |||
We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows: | We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows: | ||
# First run an analysis on all the terms occurring in the entire dataset | # First run an analysis on all the terms occurring in the entire dataset | ||
Line 40: | Line 39: | ||
# Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business | # Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business | ||
# Each subject line of an email will then have an aggregated weightage of the terms appearing in itself. | # Each subject line of an email will then have an aggregated weightage of the terms appearing in itself. | ||
− | < | + | <br> |
+ | |||
+ | '''isCC? isForward? isReply?:'''<br> | ||
+ | * isCC?: Will indicate whether the email was CC-ed to this recipient | ||
+ | * isForward?: Will indicate whether the email was forwarded to the recipient | ||
+ | * isReply?: Will indicate whether the email was a reply |
Revision as of 16:33, 17 February 2017
Subject Line weightage: We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows:
- First run an analysis on all the terms occurring in the entire dataset
- This analysis will filter out common words, prepositions and other unimportant words that could potentially skew the results.
- The analysis will return a listen of words along with the frequency of the term’s occurrence in the dataset.
- Based on the results obtained, we would like to calculate the tf-idf of each term.
- tf: how often does the term occur in the document
- idf: how often does the term occur in other documents
- tf-idf will allow us to find the most important terms in the set of documents
- Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business
- Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.
isCC? isForward? isReply?:
- isCC?: Will indicate whether the email was CC-ed to this recipient
- isForward?: Will indicate whether the email was forwarded to the recipient
- isReply?: Will indicate whether the email was a reply