Difference between revisions of "ISSS608 2017-18 T3 Assign Manu George Mathew Data Preparation"
Line 87: | Line 87: | ||
<table> | <table> | ||
<table border='1'> | <table border='1'> | ||
− | + | ||
<th>Stage</th> | <th>Stage</th> | ||
<th>Steps Followed</th> | <th>Steps Followed</th> | ||
− | + | ||
<tr> | <tr> | ||
<td><b> 1.Importing the originally available data </b> | <td><b> 1.Importing the originally available data </b> | ||
Line 106: | Line 106: | ||
</tr> | </tr> | ||
− | + | ||
<tr> | <tr> | ||
<td><b> 2.Renaming and Creating Recoded columns to match the given data description. </b> | <td><b> 2.Renaming and Creating Recoded columns to match the given data description. </b> | ||
Line 119: | Line 119: | ||
</tr> | </tr> | ||
− | |||
<tr> | <tr> | ||
<td><b> 3.Convert the TimeStamp column into the right format </b> | <td><b> 3.Convert the TimeStamp column into the right format </b> | ||
Line 172: | Line 171: | ||
<ul> | <ul> | ||
<li>2 new columns, named <b>Source_Label</b> and <b>Target_Label</b> were created to hold these labels.</li> | <li>2 new columns, named <b>Source_Label</b> and <b>Target_Label</b> were created to hold these labels.</li> | ||
− | <Save the Data onto a .CSV File | + | <li>Save the Data onto a .CSV File</li> |
</ul> | </ul> | ||
</td> | </td> |
Latest revision as of 18:42, 8 July 2018
|
|
|
|
|
|
|
Data Preparation
Given Data Description
We are provided data OF PAST 2.5 YEARS from across the company. There are call records, emails, purchases, and meetings. The data only includes the source of each transaction, the recipient (destination), and the time of the transaction. Contents of emails or phone calls are not available. All of the provided data files have the same format.
The data are provided in comma-separated format with four columns:
- Source (contains the company ID# for the person who called, sent an email, purchased something, or invited people to a meeting)
- Etype (contains a number designating what kind of connection is made)
- 0 is for calls
- 1 is for emails
- 2 is for purchases
- 3 is for meetings
- Destination (contains company ID# for the person who is receiving a call, receiving an email, selling something to a buyer, or being invited to a meeting).
- Time stamp – in seconds starting on May 11, 2015 at 14:00.
There is a company index that shows the name of everyone in the company and their associated ID#. There are 642,631 individuals in the index.
There are four data files that cover the whole company:
- calls.csv has information on 10.6 million calls (251 MB uncompressed)
- emails.csv has information on 14.6 million emails (345 MB uncompressed)
- purchases.csv has information on 762 thousand purchases (18.8 MB uncompressed)
- meetings.csv has information on 127 thousand meetings (3.26 MB uncompressed)
There are four data files that contain information about individuals that the Insider has indicated as suspicious:
- Suspicious_calls.csv (1.76 KB uncompressed)
- Suspicious_emails.csv (1.55 KB uncompressed)
- Suspicious_purchases.csv (27 B uncompressed)
- Suspicious_meetings.csv (130 B uncompressed)
We are also provided with a list of 20 people that the insider finds suspicious.
Alex Hall, Lizbeth Jindra, Patrick Lane, Richard Fox, Sara Ballard, May Burton, Glen Grant, Dylan Ballard, Meryl Pastuch, Melita Scarpaci, Augusta Sharp, Kerstin Belveal, Rosalia Larroque, Lindsy Henion, Julie Tierno, Jose Ringwald, Ramiro Gault, Tobi Gatlin, Refugio Orrantia, and Jenice Savaria.
Steps followed for Data Preparation
The data preparation for the assignment was done entirely in r if not for the dashboard specific calculations done in Tableau. The following packages in R was used.
- tidyverse - For Data manipulation and cleaning in smaller datasets.
- data.table - For cleaning and summarising larger datasets.
- lubridate - For manipulating dates.
Stage | Steps Followed |
---|---|
1.Importing the originally available data |
|
2.Renaming and Creating Recoded columns to match the given data description. |
|
3.Convert the TimeStamp column into the right format |
|
4.Mark out the suspicious transactions into the larger dataset |
|
5.Prepare the Data for working in Gephi |
|
6.Mark out Suspicious Nodes in the nodes table and their association time with Kasios(Needed for Question 3). |
|
7.Label the Source and Target Nodes in the edges table using the Employee Index Table. |
|
8.(Question 2 and 3 Specific). |
|