Difference between revisions of "APA Feature Engineering"

From Analytics Practicum
Jump to navigation Jump to search
Line 44: Line 44:
 
# Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.  
 
# Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.  
 
<br>
 
<br>
<div style="text-align: center;">[[Image:Subjectlineweightagescreenshot.jpg|1000px]]</div>
+
[[Image:Subjectlineweightagescreenshot.jpg|1000px]]
<br>
+
<br><br>
  
 
'''Email Exchange Ratio:'''<br>
 
'''Email Exchange Ratio:'''<br>
Line 59: Line 59:
 
[[Image:EmailExchangeRatioSQL.png|600px]]
 
[[Image:EmailExchangeRatioSQL.png|600px]]
 
</div>
 
</div>
 
+
[[Image:EmailExchangeRatioResults.jpg|400px]]
<div style="text-align: center;">[[Image:EmailExchangeRatioResults.jpg|400px]]</div>
+
<br><br>
<br>
 
 
'''Average Email Exchange Size:'''<br>
 
'''Average Email Exchange Size:'''<br>
This metric takes the average of email sizes of all the emails exchanged between two employees A and B.  
+
This metric takes the average of email sizes of all the emails exchanged between two employees A and B. <br>
<div style="text-align: center;">
 
 
[[Image:EmailexSizeFormula.PNG|600px]]<br>
 
[[Image:EmailexSizeFormula.PNG|600px]]<br>
[[Image:EmailexSizeSQL.png|600px]]
+
[[Image:EmailexSizeSQL.png|600px]]<br>
[[Image:EmailexSizeResults.png|400px]]</div>
+
[[Image:EmailexSizeResults.png|400px]]
<br>
+
<br><br>
 
'''Email Chain Ratio:'''<br>
 
'''Email Chain Ratio:'''<br>
 
This metric reveals the uniqueness of the communications between two employees by considering number of unique subject lines along with frequency of emails exchanged.  
 
This metric reveals the uniqueness of the communications between two employees by considering number of unique subject lines along with frequency of emails exchanged.  
 
This shows the number of unique conversations taking place between the employees.  
 
This shows the number of unique conversations taking place between the employees.  
<div style="text-align: center;">
+
 
[[Image:EmailChainRatioFormula.PNG|400px]]
+
[[Image:EmailChainRatioFormula.PNG|400px]]<br>
[[Image:EmailChainRatioSQL.PNG|800px]]
+
[[Image:EmailChainRatioSQL.PNG|800px]]<br>
[[Image:EmailChainRatioResults.PNG|400px]]</div>
+
[[Image:EmailChainRatioResults.PNG|400px]]
<br>
+
<br><br>
 
'''Rate of exchange of emails:'''<br>
 
'''Rate of exchange of emails:'''<br>
This metric shows how often emails are exchanged between two employees, in other words, it helps us understand the regularity of email exchange.  
+
This metric shows how often emails are exchanged between two employees, in other words, it helps us understand the regularity of email exchange. <br>
<div style="text-align: center;">
+
[[Image:RateOfExchangeFormula.PNG|400px]]<br>
[[Image:RateOfExchangeFormula.PNG|400px]]
+
[[Image:RateOfExchangeSQL.PNG|800px]] <br>
[[Image:RateOfExchangeSQL.PNG|800px]]
+
[[Image:RateOfExchangeResults.PNG|400px]]
[[Image:RateOfExchangeResults.PNG|400px]]</div>
 
 
<br>
 
<br>

Revision as of 18:26, 25 February 2017

APA logo.png

HOME

 

PROJECT OVERVIEW

 

METHODOLOGY

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 

FEATURE ENGINEERING

 

OTHER PROJECTS

 


FeatureEngg.PNG




Subject Line weightage: We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows:

  1. First run an analysis on all the terms occurring in the entire dataset
  • This analysis will filter out common words, prepositions and other unimportant words that could potentially skew the results.
  • The analysis will return a listen of words along with the frequency of the term’s occurrence in the dataset.
  1. Based on the results obtained, we would like to calculate the tf-idf of each term.
  • tf: how often does the term occur in the document
  • idf: how often does the term occur in other documents
  • tf-idf will allow us to find the most important terms in the set of documents
  1. Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business
  2. Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.


Subjectlineweightagescreenshot.jpg

Email Exchange Ratio:
This metric will show the number of emails exchanged between the two employees as a ratio of the total number of emails exchanged by these employees. We assume more information is being exchanged with larger email size.

EmailExchangeRatioFormula.PNG
𝑁𝑎𝑏: Number of emails exchanged between A and B
𝑁𝑎: Number of emails sent by A
𝑁𝑏:Number of emails sent by B

EmailExchangeRatioSQL.png

EmailExchangeRatioResults.jpg

Average Email Exchange Size:
This metric takes the average of email sizes of all the emails exchanged between two employees A and B.
EmailexSizeFormula.PNG
EmailexSizeSQL.png
EmailexSizeResults.png

Email Chain Ratio:
This metric reveals the uniqueness of the communications between two employees by considering number of unique subject lines along with frequency of emails exchanged. This shows the number of unique conversations taking place between the employees.

EmailChainRatioFormula.PNG
EmailChainRatioSQL.PNG
EmailChainRatioResults.PNG

Rate of exchange of emails:
This metric shows how often emails are exchanged between two employees, in other words, it helps us understand the regularity of email exchange.
RateOfExchangeFormula.PNG
RateOfExchangeSQL.PNG
RateOfExchangeResults.PNG