ISSS608 2017-18 T3 Assign Vishal Badrinarayanan Data Preparation

TABLE NAME	SIZE
CALLS	10.6 million calls
EMAILS	14.6 million Emails
PURCHASES	762000 Purchase
MEETINGS	127000 Meetings
OTHER SUSPICIOUS PURCHASES	7 Purchase
SUSPICIOUS CALLS	71 Calls
SUSPICIOUS EMAILS	61 Emails
SUSPICIOUS MEETINGS	5 Meetings
SUSPICIOUS PURCHASES	1 Purchase
COMPANY INDEX	642,631 Employee Records

Dataset Description:

The datasets Calls, Emails ,Purchases ,Meetings , Other Suspicious Purchases , Suspicious Calls , Suspicious Emails , Suspicious Meetings , Suspicious Purchases have the same data format.

Column Name	Description
Source	Source for the communication
E-Type	Type of communication . They are divided into 4 categories: 1. Calls designated as 0 2. E-mails designated as 1 3. Purchases designated as 2 4. Meetings designated as 3
Target	Target of the Communication
Timestamp	Time of communication

The company Index dataset has details about the employees:

Column Name	Description
ID	ID of the employee
First Name	First Name of the Employee
Last Name	Last Name of the Employee

Data Preparation Step	Description
Combine the email , call , meeting and purchase dataset	The Data Preparation has been done in Tableau and R.Initially we are provided with 10 Datasets. Our aim is to get all the dataset in a similar format so that it can be used for multiple functions and visualizations. Since we cannot manually combine all the datasets using any commercial on the shelf software like jmp , Tableau due to time constraint we use R to merge all the dataset. To merge datasets we need to ensure they have the same column names. So the first step is to load the data and provide the column names as "Source","Etype","Target","Timestamp". The datasets calls , emails,meetings ,purchases are merged together as total_data.The combined size is 26046478 rows and 4 variables.
Combine Suspicious calls , emails , meeting , purchase dataset	The datasets Suspicious calls,Suspicious Emails,Suspicious Purchases,Other Suspicious purchases are merged together. But before that the column names for these tables need to be assigned and then merged by rows using rbind. The combined size is 137 observations and 4 variables.
Filter Total Dataset using aggregated suspicious list.	Since we have an aggregated suspicious data table we extract the nodes from source and target. After getting a list we filter the total_data using this suspicious list of nodes where we filter any row whose source or target will have a value from the list. We get 2035 observations and 5 variables. total[total$Source %in% susp_list \| total$Target %in% susp_list,]
Adjust Time Format	Next the time format of the timestamp needs to be altered to be used in GEPHI. WE convert it from '%d-%m-%Y %H:%M:%S' to "%Y/%m/%d %H:%M:%S" for all the above derived datasets.
Company Index Data Preparation	Based on the 20 people list we got as suspects we need to create a field called Suspect_List with values ‘Yes’ or ‘No’ which are matched with the ID’s in the company index data. Also we need to filter the company index table based on the combined suspicious list.

Tableau Data Preparation Step

Data Preparation	Description
Calculate Field – Time of the Day	Which groups timestamp to 4 categories: 1. Early Morning 2. Morning 3. Evening 4. Night
Calculated Field – E_Type Value	Gives E_Type values as: 1. 0 – Calls 2. 1 – Emails 3. 2 – Purchases 4. 4 - Meetings
Calculated Field – Max Date	Calculates Max value of Date by Source
Calculate Field – Min Date	Calculates Min value of Date by Source
Calculate Field - Tenure	Calculates tenure which is Max Date – Min Date
Calculated Filed - Emp_Status	Calculates if employee is ‘Active’ or ‘Terminated’ based on Timestamp value and Max Date.
Calculated Field- Source Target Combination	Concatenate Source and Target

Gephi Data Preparation Step

Data Preparation	Description	Images
Node List	We need to use the filtered Company Index table. Also we need to get the max and min date values from the data table we used for tableau which can be used as ‘Start Date’ and ‘End Date’ for an observation.
Edge List	We can use the Filtered Total Dataset(used in tableau data prepartion) based on the suspicious list for our edge table as they have a ‘Source’ and ‘Target’ field which gephi takes as input for an edge. Also we would have ‘Start Date’ and ‘End Date’ based on the source . The ‘Start Date’ and ‘End Date’ field can be used for Dynamic Network graph in gephi.
Import Dataset with appropriate Format selection.	There has to be columns named ‘Source’ and ‘Target’. When importing the date has to be set in String format in Gephi.
Aggregate Selection	For Dynamic graph we should not have aggregated values for the observations. So we need to select the ‘Don’t Merge’. But for Static network graphs we can merge the rows.
Create Time Interval for Dynamic Network Graph.	After importing Intervals have to be created which can be used by gephi to recognize the time range for which a value will be present in graph and how it can change dynamically. For time interval we need to merge Start Date and End Date columns in the merge column option and choose appropriate time range to create the correct interval values. This has to be done for both Node and Edge list dataset.
Enable TimeLine	Once Interval values are created , the timeline option will be enabled which can set time range and gives a flexible frame to be used as slider. With this the changes in network graph with time can be observed once slider is moved.