Difference between revisions of "YSR Final Progress"

From Analytics Practicum
Jump to navigation Jump to search
Line 102: Line 102:
 
# Receiver Group
 
# Receiver Group
  
 
 
 
===Social Network Analysis===
 
===Social Network Analysis===
  
Line 111: Line 109:
 
Gephi​: We used Gephi towards the end of our project to visualize our social networks. This tool too requires nodes and edges tables similar to Alteryx.
 
Gephi​: We used Gephi towards the end of our project to visualize our social networks. This tool too requires nodes and edges tables similar to Alteryx.
  
===Use Of Alteryx:===
+
===Use Of Alteryx===
 +
[[Image: Alteryx_workflow.png|center|600px]]
 +
We used the Alteryx Network Analysis tool to generate our preliminary Social Network Graph and to output key network statistics measures like Betweenness, Degree Centrality and Closeness Centrality. Alteryx helped us extract key network statistics that were used in further analysis. The data required was in the form of two separate nodes and edges lists.
  
We used the Alteryx Network Analysis tool to generate our preliminary Social Network Graph and to output key network statistics measures like Betweenness, Degree Centrality and Closeness Centrality. Alteryx helped us extract key network statistics that were used in further analysis. The data required was in the form of two separate nodes and edges lists.  
+
[[Image: Alteryx_SNA.png|center|600px]]
  
 
We found that the graph produced was insightful but the real value of the tool to us were the centrality measures.
 
We found that the graph produced was insightful but the real value of the tool to us were the centrality measures.

Revision as of 22:04, 17 April 2016


HOME

 

TEAM

 

PROJECT OVERVIEW

 

FINAL PROGRESS

 

PROJECT MANAGEMENT

 

DOCUMENTATION

 



Introduction

TrustSphere is the widely recognized market leader in Relationship Analytics. TrustSphere enables forward thinking organizations to unlock the inherent value of their own networks using their proprietary next generation technology. TrustSphere’s unique selling point lies in its ability to calculate relationship strengths between two people based simply on their communication flows. The solutions provide real-time intelligence and insights which help clients across the globe improve salesforce effectiveness, enterprise-wide sales insights and corporate governance. Their offerings currently are limited to Sales and Risk. Currently, TrustSphere is developing new technology that delivers insights into an organization's’ most valuable asset, its people. To set the ball rolling, TrustSphere engaged our team, YSR, to help deliver preliminary insights on its own employees. With our insights and the respective methodologies used to deduce these insights, TrustSphere can move forward in the construction of their own product offering.

The complexity of this project lay in two areas: The insights we were looking to deliver had to be matched with contextual social science in order for them to be relevant and significant. We had severe limitations in the availability of data: we were provided with only communication metadata.

Motivation: The Importance of People Analytics

Understanding and managing the core of an organization - its employees - has become imperative to business decision-makers across the world. In a global survey conducted by PwC, 34% of of CEO’s are “extremely concerned” about the management of talent. Workforce diversity has stalled1, high turnover trends are continuing, labor productivity has reduced since last year and high potential promotion numbers have dropped.

Key decision makers have begun to increase investment into HR technologies to help combat these worrying trends in the workplace. HR costs per employee have seen an increase from $1,610 per employee in 2011 to $2,015 in 2015.

Our Approach

Our understanding of employee structures and the data provided helped define our approach to this project. Relationships and networks created and maintained by employees, are a rich source of data which can be used to support better HR decision making. Until now, this data has been invisible to stakeholders who could benefit most.

By performing SNA and other exploratory data analysis, we seek to enable TrustSphere’s Human Resource department’s decision-making via insights generated from employee communications, relationships and networks.


Literature Review

As the realm of People Analytics is a relatively new space, there are few organisations delving into understanding relationships with the use of data generated insights. Additionally, the firms are limited in terms of the three V’s (Volume, Variety and Velocity) of the data and the insights it offers. A report published by PWC on the Trends in People Analytics states that advancement in People Analytics is on the rise with the emergence of a refined data governance approach, the building of comparison data into analytical tools and also the use of predictive data to take action. What is lacking is integrated and comprehensive data collection methods in organisations to provide for a robust analysis. There are few organisations that are acting upon their findings to close the gaps by addressing root causes, deciding to deal with employees on an individual level, integrating the results to existing technologies and updating the data frequently. In the pursuit of creating flatter organisational structures to promote efficiency and flexibility, there has been an increase in informal employee networks rather than formal structures. Hence, the authors concede that there is a greater need for employee social network analysis. It reveals the level of engagement of key employees that can affect the morale and productivity of the environment, either positively or negatively.

An article published by McKinsey urges organisations to tap on the power of influencers to better harness their positive traits and increase the odds of success for the organisation. An informal influencer can be broadly defined as someone who is respected and approached for advice or information. The use of influencers in a work environment could prove to be useful for implementing changes within the organisation, right from planning and direction, to execution.

Further research revealed that until recently, it had been hard to test whether the performance of organisations could depend on information flow between employees. Sources such as e-mail logs, web-based calendars and online work collaborations make it possible to track information flows within firms and assess performance of teams, departments and the entire organisation as a whole. However, the question remained as to what could be done with the limited data that we were able to obtain from our client. Duncan Watts and his team had a more comprehensive data set, with employee surveys to supplement their e-mail data and organisation charts. It is important to note that e-mail data allows for more real time and timely feedback, compared to employee surveys. While our data was lacking in terms of employee feedback, we set to define our research parameters based on what we had access to, and with the takeaway that “computational social science works well only when powerful computation is matched with careful social science”. There are a number of key centrality measures that can be gathered from Social Network Analysis data to help us better understand interactions. The following are a few of them:

  • Degree

Degree is the simplest of the node centrality measures by using the local structure around nodes only. In a binary network, the degree is the number of ties a node has. In a directed network, a node may have a different number of outgoing and incoming ties, and therefore, degree is split into outdegree and indegree, respectively.

  • Betweenness centrality

Betweenness centrality is an indicator of a node's centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. A node with high betweenness has great influence over what flows -- and does not -- in the network.

  • Closeness

Closeness is defined as the inverse of farness, which in turn, is the sum of distances to all other nodes (Freeman, 1978). The intent behind this measure was to identify the nodes which could reach others quickly. A main limitation of closeness is the lack of applicability to networks with disconnected components: two nodes that belong to different components do not have a finite distance between them. Thus, closeness is generally restricted to nodes within the largest component of a network.

  • Eigen Vector centrality

A measure of how connected an individual is to those who are highly connected.

Methodology - Iteration 1

Initial Data Cube and Data cleaning

Initial dataset given to us was pulled from the TrustSphere Outlook server. It contained 890,000 email exchanges and 13 variables, namely:

  1. Date: Timestamp of when it was sent
  2. IP Address: IP Address used
  3. Local: Server used for external communications
  4. Remote: Receiver email address
  5. Local Domain: Sender email address
  6. Remote Domain: Nature - Internal or External
  7. Originator: Too many missing values to make any sense of the column
  8. Direction: Too many missing values to make any sense of the column
  9. Domain Group: Too many missing values to make any sense of the column
  10. Inbound Count: Too many missing values to make any sense of the column
  11. Outbound Count: Too many missing values to make any sense of the column
  12. Size: Size of the email, but irrelevant to analysis
  13. Msg ID: Too many missing values to make any sense of the column

The data included both internal and external communications denoted by variable “Remote Domain”. We chose to move forward with only internal interactions because getting context for external data would be extremely difficult and was removed from our scope. We then went on and imputed variables 6 to 13 because they had too many missing values. We also deleted IP Address and Local because they would not add any value to the analysis done.

Additional Data

We now requested our client to give us a Staff-List to populate Sender/Receiver positions and Department and finally ended up with the following:

  1. Date: Timestamp of when it was sent
  2. Remote (Renamed to Receiver): Receiver email address
  3. Local Domain (Renamed to Sender): Sender email address
  4. Sender FName: first name extracted from email address
  5. Receiver FName: first name extracted from email address
  6. Sender Position
  7. Receiver Position
  8. Sender Group
  9. Receiver Group

Social Network Analysis

To analyze social networks, we considered using the following tools: R (library: igraph):​ Used to create routines for simple graphs and network analysis. It can handle large graphs very well and provides functions for generating random and regular graphs, graph visualization, centrality methods and much more. We considered using this because our analysis requirements fit well into what igraph has to offer. Alteryx​ ​(Predictive Analysis using R)​: The network analysis tool provides a way to visually interact with all kinds of data. It requires the above discussed Nodes and Edges tables to do the analysis using a workflow UI. Alteryx also gives you access to key network data such as ​betweenness, degree ​and ​closeness.​ This data can further be used on software such as ​JMP Pro​ and ​Tableau​ to run regressions to predict how employees behave or will behave based on their network statistics. Gephi​: We used Gephi towards the end of our project to visualize our social networks. This tool too requires nodes and edges tables similar to Alteryx.

Use Of Alteryx

Alteryx workflow.png

We used the Alteryx Network Analysis tool to generate our preliminary Social Network Graph and to output key network statistics measures like Betweenness, Degree Centrality and Closeness Centrality. Alteryx helped us extract key network statistics that were used in further analysis. The data required was in the form of two separate nodes and edges lists.

Alteryx SNA.png

We found that the graph produced was insightful but the real value of the tool to us were the centrality measures.

We then worked on these network statistics to compute what we called our influencer score: The following formula was used to scale Betweenness and Degree:

                                       (new max - new min)
 Scaled Value  =      --------------------------------------------------
                      (old max - old min) x (value - old max) + (new max)

These scaled values (on scales of 0 - 100) were then summed up to calculate our Influencer Score on a total scale of 200.

Review of Previous Work

Our preliminary course of action had us importing the cleaned communication log data file into Tableau. The following describes the areas covered in our initial analysis:

a) Understanding the number of active employees in the organization on a monthly basis
We defined an active employee as one attending work for that particular month. It is important to note that an employee on leave for that month would be counted as a non-active employee. We assumed that an active employee would send at least 1 email per month. The insights gained from these graphs will allow managers to understand how to increase efficiency (For eg. Optimal planning of work based on trends behind workforce participation across different months over a year).
This was done on a temporal axis of months from May, 2014 to Dec, 2015. The term Active is used to describe email activity in specific as that is what is relevant in finding social ties based on our hypothesis that higher number of emails signifies a stronger network in an organization.
As we see below, there is a steady increase from somewhere around 40 all the way up to 45 in a particular month.

Number of active employees by month.png



b) Mapping the absolute number of emails sent per person per month through the use of a Heatmap
The absolute volume of emails sent and received by an employee per month is one of many indicators of the workload of an employee. For illustration purposes, we used a heatmap to describe workload per month for each employee. This information can help:

  • Managers allocate work across employees in a single department better
  • Understand responsiveness of employees to collaboration - an employee that receives in excess to what he or she sends could be assumed to not collaborate well. However, these insights will only flag potential issues for a manager - to get down to the root cause, more information will be needed.


Emails received nd sent by employee per month.png


c) Measuring growth or decline of the size of an employee’s network

The number of relationships an employee has at different points of time can provide an array of insights for a manager. We have assumed that a single communication flow (either forward or backward) will constitute a relationship in an organization. For the purpose of this project we focus on two insights:

  • For a new employee, the number of relationships at different points of time is a suitable estimate for his or her immersion into the organization. A slow growth or decline of a new employees network could indicate a failure to integrate well into a company.
  • Number of relationships can be used as a base for comparison of employees on the same hierarchical level. For example, Sahil (a marketing executive with 50+ relationships) can be said to be doing better compared to Ananya (a marketing executive with 30+ relationships).


Growth and absolute relationships.png


d) Understanding all-organization network performers Through an initial look at our client organization’s network centrality measures for the overall organization, we understood the following insights:

  • Betweenness, a measure of influence in the organization, indicated that the client company had three highly influential individuals. These three individuals were C-Suite level employees. According to research, “perceived power distance” between executives at the top of the organization and employees at the front lines are reasons behind poor adoption of corporate level initiatives.
  • Degree, an overall measure of number of relationships employees have, proved a majority of the employees are well immersed into the organization.


Screen Shot 2016-02-28 at 10.12.12 pm.png


Methodology

Over the course of the last few weeks, our client expressed interest in generating actionable insights as opposed to creating dashboards. His rationale lay in the fact that interactive dashboards have not been the best tool for creating action in the realm of people analytics. Hence, our methodology has taken a shift as we have started focusing on conducting in-depth Social Network Analysis to identify insights that could potentially help the client in developing his product further.

We conducted secondary research to learn more about terms and processes involved in social network analysis and other advancements in the field. Through our research, we identified certain core measures, such as:


We have been able to generate these values for all the subjects in our network using Alteryx. Our next step would involve running further analyses on JMP PRO and Tableau to find actionable insights for the client. Our final step would be to display our results using Gephi, to visualise the social network and highlight our findings.

Data Preparation for Social Network Analysis

Following from our methodology, the next logical step is to try and manipulate the data to visualize and analyze the social networks that exist through these email exchanges.

A quick search online showed us that most of the social network analysis tools available require your data to be in either of these forms:

  • An adjacency matrix: Receiver Names become column headers and the first value of every subsequent row is the Sender Name forming a square matrix. A snapshot of our adjacency matrix can be seen in Figure x. As you will see, Names in the column headers are Receivers and row headers are the Senders. For example, Alistair in row 3 has sent a total of 648 emails to Adesh.

YSR adjacency.png

  • Nodes and Edges lists: Some tools required us to use nodes and edges.

-A Nodes list is one that defines all the nodes (Employee Names in our case) in the network.
-An Edges list is one that defines the connections (Emails sent from and to in our case) in the network.

To analyze social networks, we considered using the following tools:

  • R (library: igraph): Used to create routines for simple graphs and network analysis. It can handle large graphs very well and provides functions for generating random and regular graphs, graph visualization, centrality methods and much more. We considered using this because our analysis requirements fit well into what igraph has to offer.
  • Alteryx (Predictive Analysis using R): The network analysis tool provides a way to visually interact with all kinds of data. It requires the above discussed Nodes and Edges tables to do the analysis using a workflow UI. Alteryx also gives you access to key network data such as betweenness, degree and closeness. This data can further be used on software such as JMP Pro and Tableau to run regressions to predict how employees behave or will behave based on their network statistics.
  • Gephi: We will use Gephi towards the end of our project to visualize our social networks. This tool too requires nodes and edges tables similar to Alteryx.


Looking Ahead

In the coming weeks we look to further immerse ourselves in deeper analysis of our clients social network. We look to add a further layer of context by comparing group (geographical, hierarchical and departmental) metrics to understand performance levels across different boards.

Our current analysis has been limited to communication flows within the organization. This limited view is in line with our client’s requirements of us. However, in the coming weeks we look to include external communication flows into our analysis. It is important to note that the analysis we look to run with external networks will be different to the analysis run previously. Coupled with research we hope to concrete some of the key metrics that are flagged when an employee leaves an organization.

References

http://www.mckinsey.com/insights/organization/power_to_the_new_people_analyticsa

https://en.wikipedia.org/wiki/People_analytics

https://www.crunchbase.com/organization/trustsphere#/entity

https://en.wikipedia.org/wiki/Betweenness_centrality

http://toreopsahl.com/tnet/weighted-networks/node-centrality/