Difference between revisions of "Assign Preparations"

From Visual Analytics and Applications
Jump to navigation Jump to search
(Created page with "<div style=background:#000000 border:#D1F2EB> 210px <font size = 6; color="#FFFFFF">VAST 2018 Mini Case 3: Net of Justice Has Long Arms </font> </...")
 
Line 35: Line 35:
  
 
__TOC__
 
__TOC__
== Background ==
+
== Data Manipulations ==
 +
 
 +
=== About the Dataset ===
 +
 
 +
The insider has provided us with 10 datasets consisted of 4 files (calls, emails, meetings, purchases) with daily connections records among every staff in Kasios from year 2015 to year 2017, 5 files of suspicious records relative to each connections channels, and 1 file that has revealed all the staff ID in Kasios. And there is also a letter from the insider, that has revealed another group of people that he (she) thinks to be involved which may need further investigations. All connections data are in the same format as decriptions below:
 +
 
 +
<p>[[File:YL DPre Col.JPG|500px|center]]</p>
 +
 
 +
It's made up by 4 columns with source, connection type that was designated by number, destination and time stamp. Data needs to be further formatted because there is no title, and timestamps are seconds counted from may 11, 2015 at 14:00. All connection types will union so that all interactions can be captured. All data manipulations were done in R.
 +
 
 +
=== Packages ===
 +
* Package '''anytime''' will cast strings into parsed date or datetime, so that we will be able add seconds to timestamp and get exact date time.
 +
* Package '''dplyr''' and '''tidyr''' will give us options for data manipulations like mutate, select and etc..
 +
 
 +
  library(tidyverse)
 +
  library(knitr)
 +
  library(readr)
 +
  library(anytime)  # For time formating
 +
  library(dplyr)  # For data manipulations
 +
  library(tidyr)  # For data manipulations
 +
 
 +
=== Import Data ===
 +
* Daily connection records
 +
  # ALL source files calls.csv, emails.csv, meetings.csv and purchases.csv have been loaded into the workspace
 +
 
 +
  # Union all sources and drop original column 'timestamp'
 +
  sources <- select(mutate(rbind(calls, emails, meetings, purchases),
 +
                          DT = anytime("2015-05-11 14:00:00") + Timestamp)
 +
                    ,c(1,2,3,5))
 +
 
 +
* Suspicious Connections Records
 +
Same formatting has also been done to suspicious connections lists. As a result, all connections from suspected staffs can be filtered out. This filtered dataset can be used as a breaking point for analysing suspected staffs. In the filtered connection dataset of suspected staffs, 945 records come from calls, 1,179 records come from emails.
 +
 
 +
  # Suspucious
 +
  sps <- rbind(read.csv("MC3/Suspicious_calls.csv",header = FALSE),
 +
              read.csv("MC3/Suspicious_emails.csv",header = FALSE),
 +
              read.csv("MC3/Suspicious_meetings.csv",header = FALSE),
 +
              read.csv("MC3/Suspicious_purchases.csv",header = FALSE))
 +
  colnames(sps) <- c("Source","Etype","Destination","Timestamp")
 +
 
 +
        # Format Datetime
 +
  sps <- mutate(sps, DT = anytime("2015-05-11 14:00:00") + Timestamp)
 +
  sps <- select(sps, c(1,2,3,5))
 +
 
 +
  # filter only suspicious staff from source file
 +
  sps_list <- unique(rbind(sps$Source,sps$Destination))
 +
  filter_sps <- rbind(subset(sources, Source %in% sps_list),
 +
                      subset(sources, Destination %in% sps_list))
 +
 
 +
  # Summarise suspicios connections
 +
  filter_sps %>%
 +
      group_by(as.factor(filter_sps$Etype)) %>%
 +
      summarise(count = n())

Revision as of 23:59, 8 July 2018

YL title bg.jpg VAST 2018 Mini Case 3: Net of Justice Has Long Arms


BACKGROUND

DATA PREPARATION

EXPLORATIONS

NETWORK ANALYSIS

CONCLUSIONS


Data Manipulations

About the Dataset

The insider has provided us with 10 datasets consisted of 4 files (calls, emails, meetings, purchases) with daily connections records among every staff in Kasios from year 2015 to year 2017, 5 files of suspicious records relative to each connections channels, and 1 file that has revealed all the staff ID in Kasios. And there is also a letter from the insider, that has revealed another group of people that he (she) thinks to be involved which may need further investigations. All connections data are in the same format as decriptions below:

YL DPre Col.JPG

It's made up by 4 columns with source, connection type that was designated by number, destination and time stamp. Data needs to be further formatted because there is no title, and timestamps are seconds counted from may 11, 2015 at 14:00. All connection types will union so that all interactions can be captured. All data manipulations were done in R.

Packages

  • Package anytime will cast strings into parsed date or datetime, so that we will be able add seconds to timestamp and get exact date time.
  • Package dplyr and tidyr will give us options for data manipulations like mutate, select and etc..
 library(tidyverse)
 library(knitr)
 library(readr)
 library(anytime)   # For time formating
 library(dplyr)  # For data manipulations
 library(tidyr)  # For data manipulations

Import Data

  • Daily connection records
 # ALL source files calls.csv, emails.csv, meetings.csv and purchases.csv have been loaded into the workspace
 # Union all sources and drop original column 'timestamp'
 sources <- select(mutate(rbind(calls, emails, meetings, purchases),
                          DT = anytime("2015-05-11 14:00:00") + Timestamp) 
                   ,c(1,2,3,5))
  • Suspicious Connections Records

Same formatting has also been done to suspicious connections lists. As a result, all connections from suspected staffs can be filtered out. This filtered dataset can be used as a breaking point for analysing suspected staffs. In the filtered connection dataset of suspected staffs, 945 records come from calls, 1,179 records come from emails.

 # Suspucious
 sps <- rbind(read.csv("MC3/Suspicious_calls.csv",header = FALSE),
              read.csv("MC3/Suspicious_emails.csv",header = FALSE),
              read.csv("MC3/Suspicious_meetings.csv",header = FALSE),
              read.csv("MC3/Suspicious_purchases.csv",header = FALSE))
 colnames(sps) <- c("Source","Etype","Destination","Timestamp")
       # Format Datetime
 sps <- mutate(sps, DT = anytime("2015-05-11 14:00:00") + Timestamp)
 sps <- select(sps, c(1,2,3,5))
 # filter only suspicious staff from source file
 sps_list <- unique(rbind(sps$Source,sps$Destination))
 filter_sps <- rbind(subset(sources, Source %in% sps_list),
                     subset(sources, Destination %in% sps_list))
 # Summarise suspicios connections
 filter_sps %>%
     group_by(as.factor(filter_sps$Etype)) %>%
     summarise(count = n())