ISSS608 2017-18 T3 Assign Minami Yusuke
Author
Yusuke MINAMI
https://www.linkedin.com/in/yusukeminami
Background
This page is my submission for ISSS608 Visual Analytics
and Applications course assignment based on VAST Challenge 2018
Mini-Challenge 3.
The datasets were downloaded from:
VAST Challenge 2018 Mini-Challenge 3
http://vacommunity.org/VAST+Challenge+2018+MC3
Tasks
1. Elucidate changes of the company over time
2. Locate the suspicious group, elucidate the employees closely associated with the suspicious group, and highlight the employees who are making suspicious purchases
3. Elucidate the interactions within the suspicious group over time
4. Elucidate the Potentially illicit purchases
Data Preparation
Data was prepared by Python in Jupyter notebook environment taking advantage of
Pandas library.
1.
Read the following 4 CSV files including connections and
concatenate.
calls.csv
emails.csv
meetings.csv
purchases.csv
Convert the seconds since May 11, 2015 at 14:00 to Timestamp data type.
Calculate the hour-level timestamp for summarization later.
Convert the Etype code (0-3) to call, email, meeting, or purchase.
Change the long format to wide format with summarizing the counts for each connection type (call, email, meeting, or purchase) in hour-level time slots.
Save as CSV file "wide_df.to_csv('VAST_Summary_Wide.csv".
2. & 3.
Read the following 4 CSV files including the suspicious
connections and concatenate.
Suspicious_calls.csv
Suspicious _emails.csv
Suspicious _meetings.csv
Suspicious _purchases.csv
Extract the suspicious employee list.
Add Source Suspicious flag (whether the source node is in the suspicious list), Target Suspicious flag (whether the target node is in the suspicious list), and Employee Suspicious flag (whether either the source or target is in the suspicious list).
Extract connections with Employee Suspicious flag = True.
Add the employee name by joining "CompanyIndex.csv" on employee IDs.
Save as "Emp_Suspicious.csv".
Change the long format to wide format with summarizing the counts for each connection type (call, email, meeting, or purchase) in hour-level time slots.
Save as "Emp_Suspicious_Wide.csv".
Calculate the month-level timestamp.
Save as "Emp_Suspicious_Wide_YM.csv"
Extract the unique suspicious employee IDs.
Add the employee name by joining "CompanyIndex.csv" on employee IDs.
Save as "Suspicious_Nodes.csv".
4.
Read "Other_suspicious_purchases.csv"
Extract the list of employees who made potentially illicit purchases.
Add Source Potentially Illicit flag to the larger dataset, and extract only the Source Potentially Illicit flag = True.
Add Potentially Illicit flag.
Save as “Potentially_Illicit.csv”.
Design of Visualization
Tableau Desktop Professional Edition 2018.1.0 and Gephi 0.9.2 were used to
visualize the datasets.
1.
Counts for each of the 4 connection types (calls, emails, meetings, and purchases) in each month were visualized. Colors were used to differentiate the weekday, hour, and day to explore any patterns.
2. & 3.
In Graph Theory, Adjacency Matrix is used to describe the existence of edges
between nodes. Weighted Adjacency Matrix (or simply Adjacency Matrix in a broad
sense) is used to indicate the weight of each edges. The employees were treated
as nodes, the connections between them as edges, and the counted numbers as
weight of edges. Since the source nodes and target nodes are distinguished, the
given dataset can be treated as a directed graph. Source nodes are the
employees who called, sent an email, purchased something, or invited people to
a meeting. Each of the 4 connection types (calls, emails, meetings, and purchases)
that form directed graphs can be described by Adjacency Matrix. It is also
possible to include 4 weight values for each edge type in each cell of an
Adjacency Matrix of a multigraph, but multiple numeric values in each cell
sacrifices the interpretability.
In this assignment, a new
visualization methodology “Treemap Adjacency Matrix” is proposed. A Treemap
Adjacency Matrix indicates multiple weights in each cell by treemap utilizing
the rectangle area for weight and color for edge type. Treemap Adjacency
Matrices were generated for source nodes listed as suspicious and target nodes
listed as suspicious according to the anonymous insider. For source nodes
listed as suspicious, the edges representing connections (calls, emails,
meetings, and purchases) initiated by the 20 suspicious source nodes were
extracted. Both source node and target nodes were sorted in the descending
order of connection counts. Since there are too many target nodes not listed as
suspicious, only the target nodes with connections more than a threshold value
7. Treemap Adjacency Matrices were generated likewise for targets nodes listed
as suspicious with the threshold value 9.
To characterize the structure, communication within the group, and
change over time, Gephi was used. For layout, Fruchterman Reingold was used to
avoid overlapping of nodes. For nodes, only the 20 suspicious employees were
highlighted by red color and the labels were added. As the labels, the ID
numbers instead of employee names were used as names occupy more space that
causes overlapping each other. For edges, calls were highlighted by blue color
(the lighter blue, the more calls). The other connections were colored gray.
4.
The other connections on the days the 4 employees made
potentially illicit purchases were visualized.
Insights
Numbers of calls, emails, and purchases had been almost constant. Number of meetings had been increasing rapidly. The distributions in weekdays (Sunday - Saturday), hours (0 - 23), and days (1-31) were almost equally distributed over the whole period, and no change was observed. The pattern indicates that the datasets were artificially generated.
The 20 suspicious employees are listed in the columns. Employee ID 2040565 (Gail Feindt), ID 981554 (Sherrell Biebel), ID 786361 (Sheilah Stachniw), ID 944354 (Ferne Hards), and ID 175354 (Madeline Nindorf) received connections (calls, emails, meetings, and/or purchases) from the 20 suspicious employees and appear to be closely associated with the suspicions employees. Employee ID 2040565 (Gail Feindt) received the dominant number of purchase transactions; 15 by Employee ID 857138 (Richard Fox), 4 by Employee ID 1690582 (Meryl Pastuch), 4 by Employee ID 969089 (Tobi Gatin), and 1 by ID 623027 (Lizbeth Jinbra).
The 20 suspicious employees are listed in the columns. Employee ID 175354
(Madeline Nindorf), ID 786361 (Sheilah Stachniew), ID 981554 (Sherrel Biebel)
made connections to the 20 suspicious employees and appear to be closely
associated with the suspicious employees. The phone calls between ID 1847246
(Rosalia Larroque) and ID 728286 (Kerstin Belveal) are notable; 8 from ID
1847246 and 7 from ID 728286.
The connections with suspicious employees involved on 20th September
2017, on which suspicious purchase was recorded, were plotted. Suspicious phone
calls between ID 1847246 (Rosalia Larroque) and ID 728286 (Kerstin Belveal) and
purchase by ID 1847246 (Rosalia Larroque) were recorded at the same time
(3:38:53AM). Later in the afternoon on the same day, ID 1847246 (Rosalia
Larroque) sent an email to ID 728286 (Kerstin Belveal), and ID 728286 (Kerstin
Belveal) called , ID 1847246 (Rosalia Larroque), and then ID 1847246 (Rosalia
Larroque) made second purchase, which was recorded as suspicious.
The fiture above describes the counts of connections (calls, Emails, meetings,
andn purchases) involved with suspicious employees (as source and/or target)
including the ones listed as suspicious. There were 2 notable months; November
2015 and September 2017.
In Novmeber 2015 and September 2017,
comparatively more suspicous connections, especially calls were recorded. In
addition, there were compartively more meetings in November 2015. September
2017 was the month the suspicious purhchase was recorded as described earlier.
Whole period (2015-2017)
The figure above shows the connections between employees during the whole
period (May 2015 - Dec 2017). The ID of the 20 suspicious employees are
labelled. As discussed earlier, ID 1847246 (Rosalia Larroque) who made a
suspicious purchase and ID 728286 (Kerstin Belveal) made phone calls to each
other frequently thus connected
by light blue edge. The other suspicious employees are mostly connected with
each other as well by blue edges which means fewer phone calls. ID 857138 (Richard Fox) is connected with the most nodes
(has the largest degree), thus appears to be relatively higher position such as
a manager. It is assumed that most of the suspicious employees, including ID
1847246 (Rosalia Larroque) and ID 728286 (Kerstin Belveal), report to ID
857138 (Richard Fox).
The figures that show connections in each
month follow.
Oct 2015
Nov 2015
Dec 2015
August 2017
Sep 2017
Oct 2017
There were numerous connections between the suspicious employees in November
2015, followed by September 2017, while not so much in the other months.
The figure above shows the connections made by the 4 employees who made potentially illicit purchases. The potential illicit purchases were highlighted by orange color above.
The figure above zoomed in the days on which the employees made potentially illicit purchases. 4 out of 7 potentially illicit purchases occurred together with phone calls