Difference between revisions of "APA Final Progress"

From Analytics Practicum
Jump to navigation Jump to search
(Added Survey information)
Line 30: Line 30:
 
{|style="width:100%;vertical-align:top;margin-top:20px;"
 
{|style="width:100%;vertical-align:top;margin-top:20px;"
 
|-
 
|-
|style="vertical-align:top;width:30%;" | <div style="background: #10d0e5; padding: 13px; font-weight: bold; text-align:center; line-height: wrap_content; text-indent: 20px;font-size:20px; font-family:helvetica"> <font color= #ffffff>Preliminary analysis</font></div><br/>
+
|style="vertical-align:top;width:30%;" | <div style="background: #10d0e5; padding: 13px; font-weight: bold; text-align:center; line-height: wrap_content; text-indent: 20px;font-size:20px; font-family:helvetica"> <font color= #ffffff>Stage 2: Explore and Clean Data</font></div><br/>
 
<p>
 
<p>
  
Line 178: Line 178:
 
<br>
 
<br>
  
<big>'''Survey Questions'''</big><br>
+
{|style="width:100%;vertical-align:top;margin-top:20px;"
Our aim is to use the survey to validate if an email exchange network is a good tool to calculate influence score. We define Influence Score as the extent to which an individual sways information flow in the workplace.
+
|-
We defined six different types of information flow, as mentioned below:
+
|style="vertical-align:top;width:30%;" | <div style="background: #10d0e5; padding: 13px; font-weight: bold; text-align:center; line-height: wrap_content; text-indent: 20px;font-size:20px; font-family:helvetica"> <font color= #ffffff>Stage 3: Feature Engineering + Survey </font></div><br/>
 +
 
  
[[File:Survey.JPG|700px|center]]
+
<center>
 +
Click on [[APA_Feature Engineering|<font  face ="Century Gothic" color="#00C5CD"><strong><i> FEATURE ENGINEERING</i></strong></font>]] for more details.
 +
</center>
 
<br>
 
<br>
  
 +
'''Mode of data collection:''' Online survey <br>
 +
'''Target Sample:''' All employees in the company (across geographies)<br>
 +
'''Aim:''' To use the survey to validate if an email exchange network is a good tool to calculate influence score. We define Influence Score as the extent to which an individual sways information flow in the workplace. <br>
 +
'''Summary:''' The purpose of the survey is to validate the use email exchange network for calculating influence score, where, influence score is defined as the extent to which an individual sways information in the workplace. <br>
 +
In a work environment, as there as be different kinds of information flow, we divided the term influence in to six main categories –
 +
*'''Social:''' defined as any interaction regarding the business with any colleague. This gives a high level view of the kind of interactions and volumes of interactions between employees.
 +
** <i>How many times do you interact with the following colleagues regarding business topics, within a month?</i>
 +
*'''Information sharing:''' defined as an interaction when job related resources or information is transferred between employees.
 +
** <i>How many times do you receive job related information from the following colleagues within a month?</i>
 +
* '''Problem solving:''' defined as an interaction where employees seek help in solving problems. These interactions will be dependent on the kind of work-related problems an employee regularly faces.
 +
** <i>How many times do you seek help from the following colleagues for business/technical related problems within a week?</i>
 +
* '''Decision making:''' defined as an interaction between two employees where one employee consults the other on a specific business related decision to make.
 +
** <i>How many times do you consult the following colleagues if you have a work related decision to make, within a week?</i>
 +
* '''Support:''' defined as an interaction wherein an employee provides career advice to another employee.
 +
** <i>How many times do you discuss your career prospects and progression with the following colleagues in a year?</i>
 +
* '''Idea generation:''' defined as an interaction between two employees that involves the discussion of novel ideas or approaches.
 +
** <i>How many times do you discuss, share or brainstorm novel ideas with the following colleagues, in a quarter?</i>
 +
 +
<br>
 +
<big>'''You may view the'''</big>
 +
[https://smusg.asia.qualtrics.com/jfe/form/SV_6eVxySZKg8NAW2N <font face ="Century Gothic" color="#00C5CD"><strong><i><big>survey here.</big></i></strong></font>]
 
<p>
 
<p>

Revision as of 18:32, 26 February 2017

APA logo.png

HOME

 

PROJECT OVERVIEW

 

METHODOLOGY

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 

FEATURE ENGINEERING

 

OTHER PROJECTS

 
Stage 2: Explore and Clean Data

Email Data
Before Cleaning
Our data consists of 14 columns as described below:

Column Explanations
Date Timestamp of the email
Remote IP If the email exchange is external then this column shows the external person's email
Remote The TrustSphere employee who is receiving or sending the email
Remote Domain Always TrustSphere
Local Email address of the person sending the email
Local Domain Domain of the person who is sending the email
Originator Inbound, outbound or internal (if you’re receiving the email, sending it or if the email is between 2 TrustSphere employees)
Direction Always TrustSphere in this case
Domain Group Email Header (Subject Line)
Subject Type of message: email/im (instant messaging)/voice/sms
Inbound Count Number of emails received
Outbound Count Number of emails sent
Size Size of the message (number of characters)
Msgid Encoded Message ID
Data Statistics
Number of rows 121,154
Date Range 11/26/2016 8:00 am to 02/01/2017 00:00 am


After Cleaning
For our analysis, we only need emails sent internally, that is, from one employee in the company to another. Our data also contained instant messaging information which we will also not be using for our analysis. Thus we filtered out

  • Originator: Inbound and Outbound (which is either to or from external email addresses)
  • Subject: im (instant messaging)

Additionally, we also found several system email addresses that can potentially skew the data (due to mass emails). Hence, we decided to eliminate emails to and from these email addresses as well. Below, we have listed some of the email addresses and the number of times they occurred in the data set.

analytics@trustsphere.com (134) heartbeat@trustsphere.com (1658)
jira@trustsphere.com (1386) sfdc@trustsphere.com (899)
trustsphere.office@trustsphere.com (15) tv.reports@trustsphere.com (1394)
marketing.team@trustsphere.com (322) trustvault.selfservice@trustsphere.com (95)
Number of rows after removing:
im + inbound/outbound 45,855
system email addresses 29,797

Further, we removed columns that we do not need in our analysis. These include:

  • Remote IP: We don't need this because we are only using internal email exchange communication
  • Remote Domain: Since we only have internal data, this will always be TrustSphere's domain. Hence keeping it will be redundant.
  • Local Domain: Since we only have internal data, this will always be TrustSphere's domain. Hence keeping it will be redundant.
  • Direction: This will always be within TrustSphere
  • Inbound Count: At this stage, we are focusing on relationship analysis (between two employees) rather than individual analysis within a network. We may need this column in the second half of our project.
  • Outbound Count: At this stage, we are focusing on relationship analysis (between two employees) rather than individual analysis within a network. We may need this column in the second half of our project.
  • Subject: We are only using email data, and hence this will always have a value of 'em'. Hence keeping it will be redundant.


Email Data

Name Name of employee
Hierarchy Designation of employee
Department Department of employee
Location Location where the employee is based

Department.png Hierarchy.jpg Location.png

  • Department: Marketing, Development and Sales have the most number of employees
  • Hierarchy: Associates are the highest in number
  • Location: Singapore being the head quarters has the most number of employees


Employeemissing.png
Highlighted in Pink: Employee is present in staff list but do not have any email interaction in the past 10 weeks. There are 8 such employees.
Highlighted in Yellow: Employee is present in the email interaction data for the past 10 weeks, but not present in the staff list. There are 5 such employees.

Network Exploration
Node: Each employee
Node Color: Hierarchy
Node Size: Eigenvector Centrality

Conclusions No weights for edges – purely based on quantity Many Senior Management and Upper Management Employees seem to have a low centrality score Possibly a biased solution Need for feature engineering to add weight that removes the bias


Gephi.png

Stage 3: Feature Engineering + Survey


Click on FEATURE ENGINEERING for more details.


Mode of data collection: Online survey
Target Sample: All employees in the company (across geographies)
Aim: To use the survey to validate if an email exchange network is a good tool to calculate influence score. We define Influence Score as the extent to which an individual sways information flow in the workplace.
Summary: The purpose of the survey is to validate the use email exchange network for calculating influence score, where, influence score is defined as the extent to which an individual sways information in the workplace.
In a work environment, as there as be different kinds of information flow, we divided the term influence in to six main categories –

  • Social: defined as any interaction regarding the business with any colleague. This gives a high level view of the kind of interactions and volumes of interactions between employees.
    • How many times do you interact with the following colleagues regarding business topics, within a month?
  • Information sharing: defined as an interaction when job related resources or information is transferred between employees.
    • How many times do you receive job related information from the following colleagues within a month?
  • Problem solving: defined as an interaction where employees seek help in solving problems. These interactions will be dependent on the kind of work-related problems an employee regularly faces.
    • How many times do you seek help from the following colleagues for business/technical related problems within a week?
  • Decision making: defined as an interaction between two employees where one employee consults the other on a specific business related decision to make.
    • How many times do you consult the following colleagues if you have a work related decision to make, within a week?
  • Support: defined as an interaction wherein an employee provides career advice to another employee.
    • How many times do you discuss your career prospects and progression with the following colleagues in a year?
  • Idea generation: defined as an interaction between two employees that involves the discussion of novel ideas or approaches.
    • How many times do you discuss, share or brainstorm novel ideas with the following colleagues, in a quarter?


You may view the survey here.