Difference between revisions of "APA Feature Engineering"

From Analytics Practicum
Jump to navigation Jump to search
m (changed ECR)
(changed ECR)
Line 85: Line 85:
 
<br><br>
 
<br><br>
 
<div style="border:1px solid rgb(85,198,225); -moz-border-radius:15px;width:250px;padding:10px;text-align: center;">
 
<div style="border:1px solid rgb(85,198,225); -moz-border-radius:15px;width:250px;padding:10px;text-align: center;">
[[Image:EmailExchangeRatioFormula.PNG|200px]] <br>
+
[[File:ECRformula.jpg|200px]] <br>
 
𝑁𝑎𝑏: Number of emails exchanged between A and B <br>
 
𝑁𝑎𝑏: Number of emails exchanged between A and B <br>
 
𝑁𝑢: Number of emails exchanged between A and B with unique subject lines<br>
 
𝑁𝑢: Number of emails exchanged between A and B with unique subject lines<br>
Line 95: Line 95:
 
count( distinct `Domain group`) as uniqueEmails FROM midterm group by least(`Remote`,`Local`), greatest(`Remote`,`Local`)) as t1;</font>
 
count( distinct `Domain group`) as uniqueEmails FROM midterm group by least(`Remote`,`Local`), greatest(`Remote`,`Local`)) as t1;</font>
 
<br><br>
 
<br><br>
In this SQL statement, ‘Local’ is the sender of the email while ‘Remote’ is the receiver of the email. The variable “Domain Group” corresponds to the subject line of the email. A part of the results is shown in the table 7 below:
+
In this SQL statement, ‘Local’ is the sender of the email while ‘Remote’ is the receiver of the email. The variable “Domain Group” corresponds to the subject line of the email. A part of the results is shown in the table below:
 
<br><br>
 
<br><br>
[[Image:EmailChainRatioResults.PNG|400px]]
+
[[File:ECRresults.jpg|400px]]
 
<br><br>
 
<br><br>
 
'''Rate of exchange of emails:'''<br>
 
'''Rate of exchange of emails:'''<br>

Revision as of 14:23, 22 April 2017

APA logo.png

HOME

 

PROJECT OVERVIEW

 

METHODOLOGY

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 

FEATURE ENGINEERING

 

OTHER PROJECTS

 


FeatureEngg.PNG




Subject Line weightage: We will be using subject line weightage as one of the components in determining how important and relevant a single email exchange is to the business. Our approach will be as follows:

  1. First run an analysis on all the terms occurring in the entire dataset
  • This analysis will filter out common words, prepositions and other unimportant words that could potentially skew the results.
  • The analysis will return a listen of words along with the frequency of the term’s occurrence in the dataset.
  1. Based on the results obtained, we would like to calculate the tf-idf of each term.
  • tf: how often does the term occur in the document
  • idf: how often does the term occur in other documents
  • tf-idf will allow us to find the most important terms in the set of documents
  1. Using the value of this tf-idf, we will assign each term a weightage based on how important it is in determining the importance to the business
  2. Each subject line of an email will then have an aggregated weightage of the terms appearing in itself.


Subjectlineweightagescreenshot.jpg

Email Exchange Ratio:
This feature is used to show the frequency of information exchanged between any two employees. It is used to check the number of emails exchanged between two employees to show how much they interact, collaborate, and share information. The formula used for this feature is:

EmailExchangeRatioFormula.PNG
𝑁𝑎𝑏: Number of emails exchanged between A and B
𝑁𝑎: Number of emails sent by A
𝑁𝑏:Number of emails sent by B


The email data is imported into phpmyadmin and SQL is used to apply the above formula onto the data to get proper output. The SQL statement used to apply this formula is given below:

select t1.employee1, t1.employee2, (t1.total/(IFNULL(t2.total,0) + IFNULL(t3.total,0) - t1.total)) as EmailExchangedRatio from (select least(`Remote`,`Local`) as employee1, greatest(`Remote`,`Local`) as employee2, count(*) as total from midterm group by least(`Remote`,`Local`), greatest(`Remote`,`Local`)) as t1

left join (select `Local` as employee, count(*) as total from midterm group by `Local`) as t2 on t2.employee = t1.employee1

left join (select `Local` as employee, count(*) as total from midterm group by `Local`) as t3 on t3.employee = t1.employee2;

In this SQL statement, ‘Local’ is the sender of the email while ‘Remote’ is the receiver of the email. A part of the results is shown in the table below:

EmailExchangeRatioResults.jpg

Average Email Exchange Size:
This feature is used to show the quantity of information exchanged between any two employees. Assuming that a larger email size shows larger amount of information exchange, the following formula is used to calculate the average email exchange size:

Average (Size of all emails exchangedbetween A and B)

The email data is imported into phpmyadmin and SQL is used to apply the above formula onto the data to get proper output. The SQL statement used to apply this formula is given below.

select least(`Remote`,`Local`) as employee1, greatest(`Remote`,`Local`) as employee2, avg(size) from midterm group by least(`Remote`,`Local`), greatest(`Remote`,`Local`);

In this SQL statement, ‘Local’ is the sender of the email while ‘Remote’ is the receiver of the email. The variable ‘size’ corresponds to the size of the email. A part of the results is shown in the table below:

EmailexSizeResults.png

Email Chain Ratio:
This feature is used to show the variety of information exchanged between any two employees. The assumption that unique conversations taking place in emails show variety of information being shared was made. Thus, number of emails with unique subject lines shows number of different conversations taking place between employees through email data. To capture this, the following formula is used:

ECRformula.jpg
𝑁𝑎𝑏: Number of emails exchanged between A and B
𝑁𝑢: Number of emails exchanged between A and B with unique subject lines



The email data is imported into phpmyadmin and SQL is used to apply the above formula onto the data to get proper output. The SQL statement used to apply this formula is given below:

SELECT t1.employee1, t1.employee2, (t1.uniqueEmails/t1.total) as ratio from (SELECT least(`Remote`,`Local`) as employee1, greatest(`Remote`,`Local`) as employee2, count(*) as total, count( distinct `Domain group`) as uniqueEmails FROM midterm group by least(`Remote`,`Local`), greatest(`Remote`,`Local`)) as t1;

In this SQL statement, ‘Local’ is the sender of the email while ‘Remote’ is the receiver of the email. The variable “Domain Group” corresponds to the subject line of the email. A part of the results is shown in the table below:

ECRresults.jpg

Rate of exchange of emails:
This metric shows how often emails are exchanged between two employees, in other words, it helps us understand the regularity of email exchange.
RateOfExchangeFormula.PNG
RateOfExchangeSQL.PNG
RateOfExchangeResults.PNG