Difference between revisions of "YSR Project Overview"

From Analytics Practicum
Jump to navigation Jump to search
(methods)
Line 64: Line 64:
 
We conducted secondary research to learn more about terms and processes involved in social network analysis and other advancements in the field. Through our research, we identified certain core measures, such as:
 
We conducted secondary research to learn more about terms and processes involved in social network analysis and other advancements in the field. Through our research, we identified certain core measures, such as:
  
Degree
+
*<strong>Degree</strong>
 
Degree is the simplest of the node centrality measures by using the local structure around nodes only. In a binary network, the degree is the number of ties a node has. In a directed network, a node may have a different number of outgoing and incoming ties, and therefore, degree is split into outdegree and indegree, respectively.
 
Degree is the simplest of the node centrality measures by using the local structure around nodes only. In a binary network, the degree is the number of ties a node has. In a directed network, a node may have a different number of outgoing and incoming ties, and therefore, degree is split into outdegree and indegree, respectively.
  
Betweenness centrality
+
*<strong>Betweenness centrality</strong>
 
Betweenness centrality is an indicator of a node's centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node.
 
Betweenness centrality is an indicator of a node's centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node.
 
A node with high betweenness has great influence over what flows -- and does not -- in the network.  
 
A node with high betweenness has great influence over what flows -- and does not -- in the network.  
  
Closeness
+
*<strong>Closeness</strong>
 
Closeness is defined as the inverse of farness, which in turn, is the sum of distances to all other nodes (Freeman, 1978). The intent behind this measure was to identify the nodes which could reach others quickly. A main limitation of closeness is the lack of applicability to networks with disconnected components: two nodes that belong to different components do not have a finite distance between them. Thus, closeness is generally restricted to nodes within the largest component of a network.  
 
Closeness is defined as the inverse of farness, which in turn, is the sum of distances to all other nodes (Freeman, 1978). The intent behind this measure was to identify the nodes which could reach others quickly. A main limitation of closeness is the lack of applicability to networks with disconnected components: two nodes that belong to different components do not have a finite distance between them. Thus, closeness is generally restricted to nodes within the largest component of a network.  
  
Line 89: Line 89:
  
 
*Nodes and Edges lists: Some tools required us to use nodes and edges.  
 
*Nodes and Edges lists: Some tools required us to use nodes and edges.  
-A Nodes list is one that defines all the nodes (Employee Names in our case) in the network.
+
-A Nodes list is one that defines all the nodes (Employee Names in our case) in the network.<br/>
 
-An Edges list is one that defines the connections (Emails sent from and to in our case) in the network.
 
-An Edges list is one that defines the connections (Emails sent from and to in our case) in the network.
  

Revision as of 23:43, 28 February 2016


HOME

 

TEAM

 

PROJECT OVERVIEW

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 



Scope

Access to TrustSphere’s datasets will allow the team to build a system from scratch using previously unused raw data to better understand turnover and attrition rules.

The minimum research points we would like to address:

  • Understand the number of relationships an employee will have at different periods of time in his or her working life
  • Measure the speed of growth at which employee relationships grow in a company
  • Correlations between the sizes of internal and external relationships employees have
  • Through social network analysis, calculate the likelihood of an employee in an informal group leaving a company upon the exit of another closely tied employee
  • Identification of metrics that can help predict the likelihood of an employee leaving

It is important to note that the scope of this project is fluid and can be furthered to address additional questions TrustSphere might have regarding the dataset.

Motivation and Objectives

In addition to our initial scope of research, we increased our scope to encompass the following: a) New Employee immersion: Onboarding, also known as organizational socialization, refers to the mechanism through which new employees acquire the necessary knowledge, skills, and behaviors to become effective organizational members and insiders. Effective onboarding allows an employee to better integrate into a company thereby transforming them into an asset faster. A key metric to understand organizational immersion is the number of internal relationships the employee has made at different points of time. By benchmarking the average speed of internal relationship growth, a company can assess the effectiveness of their onboarding programs on particular employees. b) Influencer identification for adoption of new enterprise level initiatives: with the launch of new enterprise level initiatives, one of the key concerns that arises is employee level adoption of these strategic changes3. Identifying key influencers in an organization and enrolling them as champions for enterprise level change is one way to increase adoption. Through SNA, these nodes of activity can easily be discovered. c) Contextual employee performance levels: The number-one predictive element of an individual’s success in an organization is the number, the quality, and the depth of social capital—the personal relationships among those that they do business with. By creating metrics (through insights gained from SNA), a geographic and department-level system average can be created to understand employees that are underperforming or overperforming. d) Levels of collaboration: Organisations that encourage employee productivity through collaboration across networks rather than simple individual task completion will require to actively monitor collaboration silos in an organization.


Data

The dataset given to us was pulled from the outlook (mail server used at TrustSphere) database. The data basically is an exchange of emails. The dataset consists of 890,000 email exchanges and 13 variables out of which we found the following 8 to be relevant:

  • Date: includes the date and time of a particular email being sent
  • Originator: identifies the originator of an email thread
  • Direction: indicates the direction of the email sent
  • Domain Group: identifies the company to which an email address belongs to
  • Inbound Count: number of emails being received from a particular address
  • Outbound Count: number of emails being sent to a particular address
  • Size: the size of email in bytes
  • MsgID: unique identifier of a particular email thread

Review of Previous Work

Methodology

Over the course of the last few weeks, our client expressed interest in generating actionable insights as opposed to creating dashboards. His rationale lay in the fact that interactive dashboards have not been the best tool for creating action in the realm of people analytics. Hence, our methodology has taken a shift as we have started focusing on conducting in-depth Social Network Analysis to identify insights that could potentially help the client in developing his product further.

We conducted secondary research to learn more about terms and processes involved in social network analysis and other advancements in the field. Through our research, we identified certain core measures, such as:

  • Degree

Degree is the simplest of the node centrality measures by using the local structure around nodes only. In a binary network, the degree is the number of ties a node has. In a directed network, a node may have a different number of outgoing and incoming ties, and therefore, degree is split into outdegree and indegree, respectively.

  • Betweenness centrality

Betweenness centrality is an indicator of a node's centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. A node with high betweenness has great influence over what flows -- and does not -- in the network.

  • Closeness

Closeness is defined as the inverse of farness, which in turn, is the sum of distances to all other nodes (Freeman, 1978). The intent behind this measure was to identify the nodes which could reach others quickly. A main limitation of closeness is the lack of applicability to networks with disconnected components: two nodes that belong to different components do not have a finite distance between them. Thus, closeness is generally restricted to nodes within the largest component of a network.

We have been able to generate these values for all the subjects in our network using Alteryx. Our next step would involve running further analyses on JMP PRO and Tableau to find actionable insights for the client. Our final step would be to display our results using Gephi, to visualise the social network and highlight our findings.

Data Preparation for Social Network Analysis

Following from our methodology, the next logical step is to try and manipulate the data to visualize and analyze the social networks that exist through these email exchanges.

A quick search online showed us that most of the social network analysis tools available require your data to be in either of these forms:

  • An adjacency matrix: Receiver Names become column headers and the first value of every subsequent row is the Sender Name forming a square matrix. A snapshot of our adjacency matrix can be seen in Figure x. As you will see, Names in the column headers are Receivers and row headers are the Senders. For example, Alistair in row 3 has sent a total of 648 emails to Adesh.

YSR adjacency.png

  • Nodes and Edges lists: Some tools required us to use nodes and edges.

-A Nodes list is one that defines all the nodes (Employee Names in our case) in the network.
-An Edges list is one that defines the connections (Emails sent from and to in our case) in the network.

To analyze social networks, we considered using the following tools:

  • R (library: igraph): Used to create routines for simple graphs and network analysis. It can handle large graphs very well and provides functions for generating random and regular graphs, graph visualization, centrality methods and much more. We considered using this because our analysis requirements fit well into what igraph has to offer.
  • Alteryx (Predictive Analysis using R): The network analysis tool provides a way to visually interact with all kinds of data. It requires the above discussed Nodes and Edges tables to do the analysis using a workflow UI. Alteryx also gives you access to key network data such as betweenness, degree and closeness. This data can further be used on software such as JMP Pro and Tableau to run regressions to predict how employees behave or will behave based on their network statistics.
  • Gephi: We will use Gephi towards the end of our project to visualize our social networks. This tool too requires nodes and edges tables similar to Alteryx.

References

http://www.mckinsey.com/insights/organization/power_to_the_new_people_analyticsa

https://en.wikipedia.org/wiki/People_analytics

https://www.crunchbase.com/organization/trustsphere#/entity