ISSS608 2017-18 T3 Assign Manu George Mathew Data Preparation
|
|
|
|
|
|
|
Data Preparation
Given Data Description
We are provided data OF PAST 2.5 YEARS from across the company. There are call records, emails, purchases, and meetings. The data only includes the source of each transaction, the recipient (destination), and the time of the transaction. Contents of emails or phone calls are not available. All of the provided data files have the same format.
The data are provided in comma-separated format with four columns:
- Source (contains the company ID# for the person who called, sent an email, purchased something, or invited people to a meeting)
- Etype (contains a number designating what kind of connection is made)
- 0 is for calls
- 1 is for emails
- 2 is for purchases
- 3 is for meetings
- Destination (contains company ID# for the person who is receiving a call, receiving an email, selling something to a buyer, or being invited to a meeting).
- Time stamp – in seconds starting on May 11, 2015 at 14:00.
There is a company index that shows the name of everyone in the company and their associated ID#. There are 642,631 individuals in the index.
There are four data files that cover the whole company:
- calls.csv has information on 10.6 million calls (251 MB uncompressed)
- emails.csv has information on 14.6 million emails (345 MB uncompressed)
- purchases.csv has information on 762 thousand purchases (18.8 MB uncompressed)
- meetings.csv has information on 127 thousand meetings (3.26 MB uncompressed)
There are four data files that contain information about individuals that the Insider has indicated as suspicious:
- Suspicious_calls.csv (1.76 KB uncompressed)
- Suspicious_emails.csv (1.55 KB uncompressed)
- Suspicious_purchases.csv (27 B uncompressed)
- Suspicious_meetings.csv (130 B uncompressed)
We are also provided with a list of 20 people that the insider finds suspicious.
Alex Hall, Lizbeth Jindra, Patrick Lane, Richard Fox, Sara Ballard, May Burton, Glen Grant, Dylan Ballard, Meryl Pastuch, Melita Scarpaci, Augusta Sharp, Kerstin Belveal, Rosalia Larroque, Lindsy Henion, Julie Tierno, Jose Ringwald, Ramiro Gault, Tobi Gatlin, Refugio Orrantia, and Jenice Savaria.
Steps followed for Data Preparation
The data preparation for the assignment was done entirely in r if not for the dashboard specific calculations done in Tableau. The following packages in R was used.
- tidyverse - For Data manipulation and cleaning in smaller datasets.
- data.table - For cleaning and summarising larger datasets.
- lubridate - For manipulating dates.
Stage | Steps Followed |
---|---|
1.Importing the originally available data |
|
2.Renaming and Creating Recoded columns to match the given data description. |
|
3.Convert the TimeStamp column into the right format |
|
4.Mark out the suspicious transactions into the larger dataset |
|
5.Prepare the Data for working in Gephi |
|
6.Mark out Suspicious Nodes in the nodes table and their association time with Kasios(Needed for Question 3). |
|
7.Label the Source and Target Nodes in the edges table using the Employee Index Table. |
|
8.(Question 2 and 3 Specific). |
|