ISSS608 2017-18 T3 Assign Manu George Mathew Data Preparation

From Visual Analytics and Applications
Jump to navigation Jump to search

MGM Title Band.jpg VAST Challenge 2018:Suspense at the Wildlife Preserve

Background

Data Preparation

Question 1

Question 2

Question 3

Conclusion

Back to Dropbox

 


Data Preparation

Given Data Description

We are provided data OF PAST 2.5 YEARS from across the company. There are call records, emails, purchases, and meetings. The data only includes the source of each transaction, the recipient (destination), and the time of the transaction. Contents of emails or phone calls are not available. All of the provided data files have the same format.

The data are provided in comma-separated format with four columns:

  1. Source (contains the company ID# for the person who called, sent an email, purchased something, or invited people to a meeting)
  2. Etype (contains a number designating what kind of connection is made)
    1. 0 is for calls
    2. 1 is for emails
    3. 2 is for purchases
    4. 3 is for meetings
  3. Destination (contains company ID# for the person who is receiving a call, receiving an email, selling something to a buyer, or being invited to a meeting).
  4. Time stamp – in seconds starting on May 11, 2015 at 14:00.

There is a company index that shows the name of everyone in the company and their associated ID#. There are 642,631 individuals in the index.

There are four data files that cover the whole company:

  • calls.csv has information on 10.6 million calls (251 MB uncompressed)
  • emails.csv has information on 14.6 million emails (345 MB uncompressed)
  • purchases.csv has information on 762 thousand purchases (18.8 MB uncompressed)
  • meetings.csv has information on 127 thousand meetings (3.26 MB uncompressed)

There are four data files that contain information about individuals that the Insider has indicated as suspicious:

  • Suspicious_calls.csv (1.76 KB uncompressed)
  • Suspicious_emails.csv (1.55 KB uncompressed)
  • Suspicious_purchases.csv (27 B uncompressed)
  • Suspicious_meetings.csv (130 B uncompressed)

We are also provided with a list of 20 people that the insider finds suspicious.

Alex Hall, Lizbeth Jindra, Patrick Lane, Richard Fox, Sara Ballard, May Burton, Glen Grant, Dylan Ballard, Meryl Pastuch, Melita Scarpaci, Augusta Sharp, Kerstin Belveal, Rosalia Larroque, Lindsy Henion, Julie Tierno, Jose Ringwald, Ramiro Gault, Tobi Gatlin, Refugio Orrantia, and Jenice Savaria.

Steps followed for Data Preparation

The data preparation for the assignment was done entirely in r if not for the dashboard specific calculations done in Tableau. The following packages in R was used.

  • tidyverse - For Data manipulation and cleaning in smaller datasets.
  • data.table - For cleaning and summarising larger datasets.
  • lubridate - For manipulating dates.
Stage Steps Followed
1.Importing the originally available data
  • The datasets were read in using fread function in data table.
  • The 4 large data sets for calls, meetings, emails and purchases were concatenated to get a large dataset of 26,046,071 rows
  • The same was done for the suspicious datasets to come up with a smaller table with 137 records.
  • Here onwards, any operation done on the larger dataset is done conccurently on the suspicious dataset unless mentioned otherwise.
  • Also, any additional transformations or mutations to the data are done such that a seperate column is created so that, the integrity of the original data is not lost.
2.Renaming and Creating Recoded columns to match the given data description.
  • The column names in the imported data were different from the provided description. So they were renamed accordingly.
  • The Etype column was coded in integer format. A new column Label was created that had the descriptions of the codes as given in the data description.
3.Convert the TimeStamp column into the right format
  • We were given information that the time stamp is an integer counted in seconds from May 2015 14:00. This was converted using R to the right format and assigned to a new column adjusted_time
4.Mark out the suspicious transactions into the larger dataset
  • A new column Is_Suspicious_Edge was created in the larger dataset to keep track of the suspicious transactions.
  • After cross checking, we find out that the suspicious purchase is not marked in the bigger data as suspicious
  • Checking for the reason, it was found that the time Stamp was not matching in the record. the record was manually marked as suspicious.
5.Prepare the Data for working in Gephi


The big dataset here fits the description of a edge table in Gephi and the Employee Index data fits the description of the node table.

  • In the Edge data, The column "Destination" was renamed to "Target" as that is the name Gephi recognizes.
  • The Employee Index table was taken and a Label column was created by concatenating the "First" and "Last" Column.
6.Mark out Suspicious Nodes in the nodes table and their association time with Kasios(Needed for Question 3).
  • A list of suspicious people was extracted from the suspicious data by taking unique values of Employee ID# in the "Source" and "Target" column.
  • A start_date and end_date column was created for these employees by summarising the edge table, looking for transactions where these parties are involved either as the "Source" or the "Target".
  • Convert the above 2 columns into string format dates, so that gephi reads them in without error for creating intervals.
7.Label the Source and Target Nodes in the edges table using the Employee Index Table.
  • 2 new columns, named Source_Label and Target_Label were created to hold these labels.
  • Save the Data onto a .CSV File
8.(Question 2 and 3 Specific).
  • Filter out only the suspicious nodes (Extended Suspicious Nodes in Question 3) and their related edges in the Nodes and Edges table respectively.
  • Convert the adjusted time stamp column into a string so that gephi reads in the date intervals properly.
  • Save the Data.