APA Final Progress

From Analytics Practicum
Revision as of 15:29, 23 February 2017 by Prekshaapu.2013 (talk | contribs)
Jump to navigation Jump to search

APA logo.png

HOME

 

PROJECT OVERVIEW

 

METHODOLOGY

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 

FEATURE ENGINEERING

 
Preliminary analysis

Before Cleaning
Our data consists of 14 columns as described below:

Column Explanations
Date Timestamp of the email
Remote IP If the email exchange is external then this column shows the external person's email
Remote The TrustSphere employee who is receiving or sending the email
Remote Domain Always TrustSphere
Local Email address of the person sending the email
Local Domain Domain of the person who is sending the email
Originator Inbound, outbound or internal (if you’re receiving the email, sending it or if the email is between 2 TrustSphere employees)
Direction Always TrustSphere in this case
Domain Group Email Header (Subject Line)
Subject Type of message: email/im (instant messaging)/voice/sms
Inbound Count Number of emails received
Outbound Count Number of emails sent
Size Size of the message (number of characters)
Msgid Encoded Message ID
Data Statistics
Number of rows 121,154
Date Range 11/26/2016 8:00 am to 02/01/2017 00:00 am


After Cleaning
For our analysis, we only need emails sent internally, that is, from one employee in the company to another. Our data also contained instant messaging information which we will also not be using for our analysis. Thus we filtered out

  • Originator: Inbound and Outbound (which is either to or from external email addresses)
  • Subject: im (instant messaging)

Additionally, we also found several system email addresses that can potentially skew the data (due to mass emails). Hence, we decided to eliminate emails to and from these email addresses as well. Below, we have listed some of the email addresses and the number of times they occurred in the data set.

analytics@trustsphere.com (134) heartbeat@trustsphere.com (1658)
jira@trustsphere.com (1386) sfdc@trustsphere.com (899)
trustsphere.office@trustsphere.com (15) tv.reports@trustsphere.com (1394)
marketing.team@trustsphere.com (322) trustvault.selfservice@trustsphere.com (95)
Number of rows after removing:
im + inbound/outbound 45,855
system email addresses 29,797

Further, we removed columns that we do not need in our analysis. These include:

  • Remote IP: We don't need this because we are only using internal email exchange communication
  • Remote Domain: Since we only have internal data, this will always be TrustSphere's domain. Hence keeping it will be redundant.
  • Local Domain: Since we only have internal data, this will always be TrustSphere's domain. Hence keeping it will be redundant.
  • Direction: This will always be within TrustSphere
  • Inbound Count: At this stage, we are focusing on relationship analysis (between two employees) rather than individual analysis within a network. We may need this column in the second half of our project.
  • Outbound Count: At this stage, we are focusing on relationship analysis (between two employees) rather than individual analysis within a network. We may need this column in the second half of our project.
  • Subject: We are only using email data, and hence this will always have a value of 'em'. Hence keeping it will be redundant.


  • Exploration of network : filtered for internal employees only
  • Looked for trends based on size of message : no correlation
  • Eigenvector centrality analysis : Found biased data- Although the network generated showed certain employees to have high influence, when we showed our results to the client, they mentioned that those individuals aren’t actually that influential. We understood that this was because the ties were given equal weightage.
  • Thus, we must weigh the ties differently using subject line weighting, reply rate, whether the email is a reply, forward or cc, hierarchy of email senders or recipients etc.


Prel analysis 1.jpg


Blue = high eigenvector; White = mid; Red = low; Size of node = outdegree