Difference between revisions of "IISSS608 2017-18 T3 Assign Vigneshwar Ramachandran Vadivel- Data Cleaning"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 1: Line 1:
 
{{Template:MC3 Header}}
 
{{Template:MC3 Header}}
 +
 
<!------- Main Navigation Bar---->
 
<!------- Main Navigation Bar---->
 
<center>
 
<center>
Line 22: Line 23:
 
<center>
 
<center>
 
{| style="background-color:#ffffff ; margin: 3px 10px 3px 10px;" width="80%"|
 
{| style="background-color:#ffffff ; margin: 3px 10px 3px 10px;" width="80%"|
 +
| style="font-family:Open Sans, Arial, sans-serif; font-size:15px; text-align: center; border:solid 1px #f5f5f5; background-color: #f5f5f5" width="150px" |  [[ISSS608_2017-18_T3_Assign_Vigneshwar Ramachandran Vadivel_Approach|<font color="#3c3c3c"><strong>Main Page</strong></font>]]
 +
 
| style="font-family:Open Sans, Arial, sans-serif; font-size:15px; text-align: center; border-top:solid #f5f5f5; background-color: #fff" width="150px" |  
 
| style="font-family:Open Sans, Arial, sans-serif; font-size:15px; text-align: center; border-top:solid #f5f5f5; background-color: #fff" width="150px" |  
[[ISSS608_2017-18_T3_Assign_Vigneshwar Ramachandran Vadivel_Approach|<font color="#3c3c3c"><strong>Main Page</strong></font>]]
+
   
 
+
[[ISSS608_2017-18_T3_Assign_Vigneshwar Ramachandran Vadivel-_Data_Cleaning|<font color="#3c3c3c"><strong>Data Cleaning</strong></font>]]
| style="font-family:Open Sans, Arial, sans-serif; font-size:15px; text-align: center; border:solid 1px #f5f5f5; background-color: #f5f5f5" width="150px" |  
 
[[IISSS608_2017-18_T3_Assign_Vigneshwar Ramachandran Vadivel-_Data_Cleaning|<font color="#3c3c3c"><strong>Data Cleaning</strong></font>]]
 
  
 
| style="font-family:Open Sans, Arial, sans-serif; font-size:15px; text-align: center; border:solid 1px #f5f5f5; background-color: #f5f5f5" width="150px" |   
 
| style="font-family:Open Sans, Arial, sans-serif; font-size:15px; text-align: center; border:solid 1px #f5f5f5; background-color: #f5f5f5" width="150px" |   
Line 37: Line 38:
 
<!------- End of Secondary Navigation Bar---->
 
<!------- End of Secondary Navigation Bar---->
  
=<div  style="font-family: Courier;">Communication Data=
+
=Data Cleaning=
<div  style="font-family: Courier;">Initial part of our analysis requires us to explore the communication pattern of the company. In order to analyse the pattern, we have to combine the four-communication dataset of Kasios into one single data to have the holistic view.
+
==Communication Data==
 +
Initial part of our analysis requires us to explore the communication pattern of the company. In order to analyse the pattern, we have to combine the four-communication dataset of Kasios into one single data to have the holistic view.
 +
Using Tableau prep all the four data can be merged and transformed into our necessary format.<br>
 +
The date format of the data has been given in Unix epoch timestamp beginning from May 11, 2015 at 14:00. To transform this into our timestamp value we can derive the following calculated field,
  
Using Tableau prep all the four data can be merged and transformed into our necessary format.
+
==Network Data==
 +
Once we transform our data we can explore the communication pattern and growth of the company with this merged dataset.
  
The date format of the data has been given in Unix epoch timestamp beginning from May 11, 2015 at 14:00. To transform this into our timestamp value we can derive the following calculated field
+
For our later part of the analysis the data has to be transformed in such a way that it would be readable on Gephi for basic network visualisations. Using the insider kasios suspicious data, I
+
The following flow shows the data processing involved in this process using Tableau prep,<br>
[[File:Data Table.png|350px|centre|frameless|link=ISSS608_2017-18_T3_Assign_Vigneshwar Ramachandran Vadivel
+
[[File:A1_MC3_Data_Flow.PNG|1000px|centre|frameless|link=ISSS608_2017-18_T3_Assign_Vigneshwar Ramachandran Vadivel
 
]]  
 
]]  
  
 +
===Aggregated Communication Data===
 +
To create the edges csv file:
 +
<ol><li>1. Concatenate communication data for all different transactions into one source of information.</li>
 +
<li>Change column data type of "From" and "To" from continuous to nominal.
 +
<li>Rename columns "From" and "To" to "Source" and "Target".</li>
 +
<li>Create a column "Type" and ensure that all the fields of the column are filled with "Directed" - This helps to specify the direction of the communication.</li>
 +
<li>Create a new column with the formula "Source || Target", as this concatenates the Source and Target column.</li>
 +
<li>Use the tabulate function to count the number of times the concatenated cell appears in the column - this helps to find out the weight of the unique direction of communication.</li>
 +
<li>Update the table with the weight of the column. Change the column name to "Weight".</li>
  
=<div  style="font-family: Courier;">Network Data=
+
To create the nodes csv file:
<div  style="font-family: Courier;">
+
<ol><li> Use the tablulate function on JMP to count the number of times a certain ID has appeared in both the "Source" and "Target" column. Merge the files together and tabulate again to find out the unique nodes present in the edges csv file.</li>
<ul>Once we transform our data we can explore the communication pattern and growth of the company with this merged dataset.
+
<li>Name the column "id", so that Gephi can detect the ids of the nodes.</li></ol>
  
For our later part of the analysis the data has to be transformed in such a way that it would be readable on Gephi for basic network visualisations. Using the insider kasios suspicious data, I
+
===Communication Data from 11am to 130pm on Sunday===
The following flow shows the data processing involved in this process using Tableau prep,
+
To create the edges csv file:
[[File:A1_MC3_Data_Flow.png|500PX|centre|frameless|link=ISSS608_2017-18_T3_Assign_Vigneshwar Ramachandran Vadivel
+
<ol><li>Carry out Steps 2-5 above for the communication data on Sunday.</li>
]]
+
<li>Retain only the communication data between 11am and 1.30pm.</li>
</ul></div>
+
<li>To create the Timestamp column, use the current timestamp format, mm/dd/yyyy hh:mm.</li>
=<div  style="font-family: Courier;">Aggregated Suspicious Communication Data=
+
 
<div  style="font-family: Courier;">
+
</ol>
<br>
+
To create the nodes csv file:
To create the edges csv file:<br>
+
<ol><li> Merge the company index with the combined file together and  find out the unique nodes present in the edges csv file.</li>
1. Concatenate communication data for all different transactions into one source of information.<br>
+
<li>Name the column "id", so that Gephi can detect the ids of the nodes.</li>
2. Change column data type of "From" and "To" from continuous to nominal.<br>
+
<li>Add the label as name of the employee involved in the transaction</li></ol>
3. Rename columns "From" and "To" to "Source" and "Target".<br>
 
4. Create a column "Type" and ensure that all the fields of the column are filled with "Directed" - This helps to specify the direction of the communication.<br>
 
5. Create a new column with the formula "Source || Target", as this concatenates the Source and Target column.<br>
 
6. Use the tabulate function to count the number of times the concatenated cell appears in the column - this helps to find out the weight of the unique direction of communication.<br>
 
7. Update the table with the weight of the column. Change the column name to "Weight".<br>
 
To create the nodes csv file:<br>
 
1. Merge the company index with the combined file together and  find out the unique nodes present in the edges csv file.<br>
 
2. Name the column "id", so that Gephi can detect the ids of the nodes.<br>
 
3. Add the label as name of the employee involved in the transaction.<br>
 
</div>
 

Revision as of 10:38, 3 July 2018

MC3 Banner.png

Intro

Approach

Findings

Conclusion

Main Page

Data Cleaning

Data Organisation

Data Exploration

Data Cleaning

Communication Data

Initial part of our analysis requires us to explore the communication pattern of the company. In order to analyse the pattern, we have to combine the four-communication dataset of Kasios into one single data to have the holistic view. Using Tableau prep all the four data can be merged and transformed into our necessary format.
The date format of the data has been given in Unix epoch timestamp beginning from May 11, 2015 at 14:00. To transform this into our timestamp value we can derive the following calculated field,

Network Data

Once we transform our data we can explore the communication pattern and growth of the company with this merged dataset.

For our later part of the analysis the data has to be transformed in such a way that it would be readable on Gephi for basic network visualisations. Using the insider kasios suspicious data, I The following flow shows the data processing involved in this process using Tableau prep,

Aggregated Communication Data

To create the edges csv file:

  1. 1. Concatenate communication data for all different transactions into one source of information.
  2. Change column data type of "From" and "To" from continuous to nominal.
  3. Rename columns "From" and "To" to "Source" and "Target".
  4. Create a column "Type" and ensure that all the fields of the column are filled with "Directed" - This helps to specify the direction of the communication.
  5. Create a new column with the formula "Source || Target", as this concatenates the Source and Target column.
  6. Use the tabulate function to count the number of times the concatenated cell appears in the column - this helps to find out the weight of the unique direction of communication.
  7. Update the table with the weight of the column. Change the column name to "Weight".
  8. To create the nodes csv file:
    1. Use the tablulate function on JMP to count the number of times a certain ID has appeared in both the "Source" and "Target" column. Merge the files together and tabulate again to find out the unique nodes present in the edges csv file.
    2. Name the column "id", so that Gephi can detect the ids of the nodes.

    Communication Data from 11am to 130pm on Sunday

    To create the edges csv file:

    1. Carry out Steps 2-5 above for the communication data on Sunday.
    2. Retain only the communication data between 11am and 1.30pm.
    3. To create the Timestamp column, use the current timestamp format, mm/dd/yyyy hh:mm.

    To create the nodes csv file:

    1. Merge the company index with the combined file together and find out the unique nodes present in the edges csv file.
    2. Name the column "id", so that Gephi can detect the ids of the nodes.
    3. Add the label as name of the employee involved in the transaction