APA Final Progress

From Analytics Practicum
Jump to navigation Jump to search

APA logo.png

HOME

 

PROJECT OVERVIEW

 

METHODOLOGY

 

FEATURE ENGINEERING

 

CLASSIFICATION MODELLING

 

DOCUMENTATION

 

OTHER PROJECTS

 


MethodologyAPA.jpg


Data

Email Data
Before Cleaning
The email exchange data from TrustSphere’s Trust Vault contains all emails sent and received by TrustSphere employees. The email exchange data used for this study is filtered to only internal communication using only email as a medium (emails received and sent to external third parties and messages sent via instant messaging, voice or SMS are not included). An important point to note is that the ‘Domain group’ column from Table below only comprises of the email subject line content, that is, it does not include full email body content. This is a limitation to the study as analyses such as sentiment analysis or term valence analysis cannot be conducted on short text documents such as email subject lines.

Column Explanations
Column Column Description Data Values
Date Timestamp of the email 26th November 2016 08:00:00 AM – 1st February 2017 00:00:00 AM
Remote IP If the email exchange is external then this column shows the external person's email
Remote The TrustSphere employee who is receiving or sending the email TrustSphere employee email addresses
Remote Domain Always TrustSphere TrustSphere
Local Email address of the person sending the email TrustSphere employee email addresses
Local Domain Domain of the person who is sending the email TrustSphere (extracting only internal email exchange data)
Originator Where is the email coming from or going to Inbound/Outbound/Internal
Direction Always TrustSphere in this case ‘trustsphere’
Domain Group Email Header (Subject Line)
Subject Type of message email/im (instant messaging)/voice/sms
Inbound Count Number of emails received Ranges between 0 – 13423
Outbound Count Number of emails sent Ranges between 0 – 12234
Size Size of the message (number of characters) Ranges between 0 – 36,516,837 characters
Msgid Encoded Message ID Auto – generated


Cleaning Email Data
The email data extracted was from 11/26/2016 8:00 am to 02/01/2017 00:00 am. Before cleaning, there were 14 columns and 121,154 rows of data. The following steps were taken to clean the email data: 1. Removing emails with Subject not equal to ‘email’ Rationale: Analysis only on email data 2. Removing emails with Originator not equal to ‘internal’ Rationale: Analysis only on internal communication 3. Removing System Emails from Local and Remote: Rationale: Analysis on collaboration between real employees only. List of system emails found can be viewed in below –

  • List of system emails found in Local:
    • accounting@trustsphere.com (1)
    • amazons3@trustsphere.com (3)
    • analytics@trustsphere.com (134)
    • careers@trustsphere.com (4)
    • customer.care@trustsphere.com (120)
    • heartbeat@trustsphere.com (1658)
    • info@trustsphere.com (1)
    • jira@trustsphere.com (1386)
    • marketing.team@trustsphere.com (322)
    • marketing@trustsphere.com (56)
    • northamericanteamcallactionitems@trustsphere.com (9)
    • peopleanalytics@trustsphere.com (175)
    • postman@trustsphere.com (1097)
    • postmaster@trustsphere.com (39)
    • sfdc@trustsphere.com (899)
    • sg.boardroom@trustsphere.com (22)
    • support@trustsphere.com (604)
    • trustsphere.office@trustsphere.com (15)
    • trustvault.selfservice@trustsphere.com (95)
    • tv.reports@trustsphere.com (1394)
    • wordpress@trustsphere.com (21)
    • zabbix@trustsphere.com (1739)
  • List of system emails found in Remote:
    • alerts.support@trustsphere.com (1)
    • aolia@intradyn.com (1)
    • crm.report@trustsphere.com (6197)
    • customer.care@trustsphere.com (5)
    • dhartzler@intradyn.com (1)
    • marketingteam@trustsphere.com (1)
    • mgillard@intradyn.com (1)
    • postman@trustsphere.com (20)
    • sg.boardroom@trustsphere.com (25)
    • sys.admin@trustsphere.com (12)

4. Removing unnecessary columns such as:

  • Remote IP
  • Remote Domain
  • Local Domain
  • Direction
  • Inbound count
  • Outbound count
  • Subject

After the cleaning of data, there were 29,797 rows of data with no missing data instances. Staff and Email Data Comparison
Employeemissing.png
The employees highlighted in pink are employees who are present in the staff list but do not have any email interaction in the 10 weeks of data being used. There are 8 such employees. These employees may have left the company and hence, the staff data is updated accordingly (removal of employees). The employees highlighted in yellow are employees who are present in the email interaction data from the past 10 weeks, but are not present in the staff list. There are 5 such employees. These employees may be new hires and hence, the staff data is updated accordingly (addition of employees).
Staff Data
The staff data contains hierarchical, departmental, and locational information of all employees in TrustSphere as shown in the table below.

Column Column Description Data Values
Name Name of employee TrustSphere employee names
Hierarchy Designation of employee Associate/Operational/Upper Management/Senior Management/C Suite
Department Department of employee Administrative/C Suite/Development/Operations/Sales/Marketing/Product/Senior Management/Strategy
Location Location where the employee is based Australia/Canada/Malaysia/Singapore/US/Philippines/New Zealand/UK


Survey Data
Creating the survey
A survey is sent out to all TrustSphere employees to map the work network within the organization. The work network results are used to build the classification model from our email data which is explained in the later sections. The survey was created using QuaIitrics. Some of the basic details to be filled in the survey are an employee’s name, their immediate supervisor, and the number of years they have worked in the company for. The main part of the survey is to generate the work network. The network is generated by asking employees to rate their colleagues on a scale of 1 (lowest) – 5 (highest) based on the question: Who do you work with on a regular basis?
It may be noted that if an employee doesn’t give any rating to another employee, the answer is taken as 0.

Adesh Goel Alistair Weatherill Annabel Koh
Adesh Goel 0 3 1
Alistair Weatherill 0 0 0
Annabel Koh 1 1 0

Cleaning survey data
The adjacency matrix above is converted into an edge list. For the Work Network, unique pairs of employees were taken as the nodes and the average of the ranking for each other was taken as the value.

Data Exploration

Email Data exploration
Email distribution for associates
Every Bubble indicates number of emails sent by an employee on a single day. The size represents the number of emails for the day. It can be observed that most associates seem to have regular email communication. It can also be observed that Bersileus, Hansel and Lemuel hardly have any communication. This is because they are new to the company.
Associates.png
Email distribution for the C-Suite
Every bubble indicates number of emails sent by an employee on a single day. The size represents the number of emails for the day. It can be observed that all C-Suite employees seem to have regular email communication. The instances of huge bubbles are possibly instances of mass emails sent to the company by the C-suite employees.
Csuite.png
Average size of email by Hierarchy
Size of an email is a potential indicator of the amount of email content. Figure 10 below shows who sends the largest emails among different hierarchies. Associates seem to have the highest email size. This can be explained by their relatively more hands-on technical work which involves a lot of data. Operational has lowest size to their independent nature and role at the company.
Hier.png

Centralities Without Edge Weights
Eigenvector and Betweenness Centrality social network was created using email data where the edges had no weights. Eigenvector and betweenness centrality were applied to this network as visible in the figure below.
Eigbet.png
Eigenvector Centrality is a measure of the influence of a node in a network. It is a global measure. Betweenness centrality of a node reflects the amount of control that this node exerts over the interactions of other nodes in the network (Information Flow). It is a relatively localized measure. People with high eigenvector centralities and low betweenness centralities may be connected to highly influential individuals. Arun Sundar (Chief Strategy Officer) and Manish Goel (Chief Executive Officer) have high centralities as they are the most important people at the firm. Furthermore, Associate Hana Owens who has relatively high global connections (eigenvector centrality), is highly connected to the high-level employees. This is also evident in the survey results. A social network was created using the survey data where employees answered how much they work with each other, i.e. work network. In the network below, size of the node is the betweenness centrality. The filters applied are 1) tie weight >= 3 (Strong) (Scale 1-5), and 2) mutual edges.
Hana.png
Figure above shows that Hana Owens is connected to high level employees like Manish Goel and Adesh Goel.
Degree Centrality
Degree centrality was applied to the social network created using email data where the edges had no weights and the graph below was derived. It can be observed that degree centrality is similar to out degree and in degree results. Thus, the study will just consider degree centrality for social network comparison instead of focusing on in-degree and out-degree.
Deg.png
Closeness Centrality
Closeness centrality was applied to the social network created using email data where the edges had no weights and the Figure below was derived.It can be observed that Closeness Centrality shows very less variation among the employees. Thus, the study will not consider closeness centrality for social network comparison.
Clo.png
Correlation Analysis
Corr.png
R-square of 0.895 between Eigenvector and Betweenness Centrality shows a very high correlation. Since Eigenvector gives a global view, it is preferable over degree centrality as a measure of analysis. Therefore, there is no need to consider degree centrality as an individual analysis. R-square of 0.5336 between Eigenvector and Betweenness Centrality shows a weak correlation. Therefore, we will consider both eigenvector and betweenness measures for separate analysis.

Feature Engineering + Survey


Click on FEATURE ENGINEERING for more details.


Mode of data collection: Online survey
Target Sample: All employees in the company (across geographies)
Aim: To use the survey to validate if an email exchange network is a good tool to calculate influence score. We define Influence Score as the extent to which an individual sways information flow in the workplace.
Summary: The purpose of the survey is to validate the use email exchange network for calculating a combination of collaboration and influence.
Work Network: We define work network as the network of employees with whom one interacts with on a daily basis for work purposes.



You may view the survey here.

Classification Modelling


Click on CLASSIFICATION MODELLING for more details.