ISSS608 2017-18 T3 Assign Yu Zhecheng

From Visual Analytics and Applications
Revision as of 01:24, 8 July 2018 by Zcyu.2017 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Detective.png VAST Challenge 2018:Secrets of Kasios Bird.png

Background & Preparation

Visualization

Insight

Conclusion

 


Mini Challenge 3 Background

The Kasios Insider has provided data from across the company. There are call records, emails, purchases, and meetings. The data only includes the source of each transaction, the recipient (destination), and the time of the transaction. However, we don't know any details about the content.

There are about 640000 employees work for Kasios. The list contains those who left the company from 2015 to 2017. Although we don't know what they exactly communicated, the users and frequency of all call records, emails, purchases and meetings indicate the development and structure of Kasios. The most important thing is that the suspicious person and activities must left records in the data sets.


Data Preparation

Four Variables

 Source: contains the company ID for the person who called, sent an email, purchased something, or invited people to a meeting
 E-type: contains a number designating what kind of connection is made.
           a.0 is for calls  
           b.1 is for emails 
           c.2 is for purchases
           d.3 is for meetings
 Destination:contains company ID for the person who is receiving a call, receiving an email, selling something to a buyer, 
                or being invited to a meeting.
 Time stamp:in seconds starting on May 11, 2015 at 14:00.

Nine original data sets


Four data sets that cover the whole company.
Another four data sets of the rest contains the information about individuals that the Insider has indicated as suspicious.
The last company index that shows the name of everyone in the company and their associated ID.

Yzc csv.png

Clean all data sets.

  • To transform the time stamp to date, I need the help of R package lubridate. The code is shown below:
 x<-ymd_h("2015-5-11 14")
 emails$timestamp<-x+dseconds(emails$timestamp)
  • Because I only care about the date.
 emails$date<-ymd(emails$timestamp)
  • To overview the development of the whole company. I want to calculate the number of emails,calls,meeting and purchases change over time.
 company_emails<-as.data.frame(table(emails$timestamp))
 combine all four data sets together to create a new data set which contains the information about the development of the company.

Yzc company.png

  • Because of the numbers between different variables are too big, it's difficult to compare the in the same level.To understand the development of the whole company, I want to scale all numbers in the company.
  • To make the data set more easily to analyze, I need to reshape the data structure.
 company<-gather(company,attribute,value,-date)
 companyScale<-gather(companyScale,attribute,value,-date)

Yzc companyScale.png

  • To find out all ID which are recorded in 4 new data set, I write a loop.
      sus<-rbind(sc,se,sm,sp)
      sus_list<-data.frame(list=c(sus$source,sus$target))
      sus_list<-unique(sus_list)
      edge<-calls[1,]
      edge[1,]<-NA
      whole<-rbind(calls,emails,purchases,meetings)
      for (i in sus_list$list) {
         temp<-subset(whole,whole$source==i | whole$target==i)
         edge<-rbind(edge,temp)}
      edge_list<-data.frame(list=c(edge$source,edge$target))
      edge_list<-unique(edge_list)
      node<-subset(CompanyIndex,CompanyIndex$ID %in% edge_list$list)
      node<-unique(node)
      sus_list<-subset(CompanyIndex,CompanyIndex$ID %in% sus_list$list)
      sus_list<-unique(sus_list)
     write.csv(node,"node.csv",row.names = FALSE)
     write.csv(edge,"edge.csv",row.names = FALSE)
     write.csv(sus,"sus.csv",row.names = FALSE)
     write.csv(sus_list,"sus_list",row.names = FALSE)

From now on, I have four files to describe the network.
edge.csv contains all information who used to communicate with suspicious.(Including calls,meetings,emails and purchases)
node.csv lists all details about the first file.
sus.csv contains all information who was doubted by insider.
sus_list.csv lists all details about the third file.

YzcEdge.png