Difference between revisions of "Forensic Ninja"

Latest revision as of 00:53, 12 October 2016

Problem and Motivation

Benford’s Law has been widely used by forensic data analysts to detect anomalies or possible fraudulent activities in an organisation. However, in the world of information, majority of the data are textual fields. For example, in an accounts payable, 70% of the data are textual data whereas only 10% of the data are numerical fields (Lanza, 2016).

Furthermore, fraudsters tend to work in groups rather than relying on their own. In 2015, 62 percent of fraudsters colluded with others (KPMG International, 2016). As 74 percent of the fraud is perpetrated by internal staff or a collusion between internal staff and external parties (KPMG International, 2016), this highlights the need for complex tools for fraud examiners to not only analyse available textual data of the firm but also visualise the interactivity among employees of an organisation.

As email is one of the preferred modes of business communication in an organisation, analysing emails can help to uncover any potential red flags in the organisation structure or culture. By using GAStech organisation email exchanges as a case study, we seek to analyse the connectivity and frequently discussed topics among employees of an organisation.

Objectives

In this project, we seek to build an interactive visualisation application that helps users to analyse connectivity and frequently discussed topics among employees of an organisation. This allows users to better visualise the organisation structure and interactivity among the employees that might suggest potential wrongdoings.

By using GAStech organisation email exchanges as a case study, the application aims to help users the following:

Understand GAStech organisational structure
Analyse frequently discussed topics among GAStech employees

Data Source

The dataset that will be used in this project can be retrieved from VAST Challenge 2014.
It mainly consists of GAStech employee records and email headers from two weeks of internal GAStech company email.

References to Related Work

Screenshots	What we can learn
Parallel Coordinates of Employee Characteristics Source:Link.Write-up:Link	Pros: Outlines the common characteristics clearly among employees Use of parallel coordinates to better visualise common characteristics among employees Cons: May not be effective in showing number of employees in the company Visualisation can be further improved by utilising more charts in the application and making it less wordy
Visualization of social network formed from 60,000 emails from personal archive Source:Link	Pros: Use of chord diagram to better visualise the connectivity among senders and recipients. Adequate spacing between labels and arranged in an orderly manner. This prevents overlapping of labels even if font size is increased. Use of time filter at the bottom to visualise how the connectivity has changed over time. Cons: Relationships among the senders and recipients might not be intuitive to the users due to the huge gap in space, created by the circle. One colour of different intensity should be used instead of using different colours to represent the number of email exchanges between two parties since only positive numbers are observed in the dataset
Thinkers’ perspectives with regards to topics discussed Source:Link	Pros: There is a hierarchical structure within the visualisation, allowing data exploration with topics and the respective thinker of the topic. Use of concept map to better visualise a thinker’s perspective based on topics discussed. Appropriate highlighting when the user hover over the list of names in the middle of the diagram. Use of appropriate animation and filters to allow users to further analyse the characteristics of the person that they are interested in. Cons: The animation transition is too fast for users to catch the changes occurring. It may cause a distraction. User friendliness can be further improved by providing a clear and more convenient way for users to get back to the initial concept map after analysing a certain thinker.
ASTRI Entry for VAST2014 MC1 Source:Link	This displays the potential connection between GASTech Company and POK, a revolutionary force. This form of (Social Networks) Graph Visualisation can display how the two separate organisations are connected, and may shed light on how the kidnapping of the employees took place. Pros: Social network diagram clearly shows who is the broker of information, meaning who is connecting POK and GAStech together and allowing the flow of information. Cons: This does not particularly show the frequency of emails sent from person to person. This visualisation could be turned into a weighted graph, with numbers representing the frequency of email between each person.
KBSI VAST2014 MC1 Entry Source:Link	This is the timeline for the events occurring on 20 January 2014 based from key words from the email headers and news articles. For our project, this can be done for specifically the email headers. Features such as a date slider can narrow down specific dates. The size of the words represent the frequency of the words mentioned in the emails. The timeline is in chronological order of when the email was sent. Pros: Commonly mentioned words are shown clearly, with size displaying frequency mentioned Cons: When the frequency of a word/phrase is too high, it may cause overlapping of the name labels, making the visualisations hard to understand.

Storyboard

Our first proposed story board consists of a chord diagram and a word cloud. The chord diagram allows users to visualise the interactivity among GAStech employees based on the sender and recipient information provided in the dataset. The thickness of the line will represent the number of email exchanges between the two parties. One colour will be used for the lines in the chord diagram since there are only positive numbers involved. Highlighting features will also be added to this chord diagram by highlighting the lines to allow users to better visualise the social network of one employee when the user hover over the name of that particular employee.

Word cloud would allow users to visualise the frequently discussed topics in the emails among GAStech employees. One colour will only be used for this word cloud as the frequency of the words can be represented by the different font size. Furthermore, when the user hover over the names of the employees, the words that are used in the email exchanges of that particular employee will also be highlighted.

Filters will be added to the diagrams to allow users to only analyse on a particular employee by clicking on his name on the chord diagram. In this case, all the information provided by the chord diagram and word cloud are only related to that particular employee. Likewise, users can also click on the words on the word cloud to find out the employees involved in the discussion of this topic. A time slider will also be added to allow users to visualise how the connectivity and frequently discussed topics among GAStech employees have changed over time.

Key Technical Challenges

1. Merging of Two Different Datasets
We will be working on two datasets, namely Employee Records and Email Headers. There is a need to have a connection created between the two databases so that it can be used effectively and simultaneously. A possible solution to this would be to link the two databases by using the Email address information column that is both available in the two databases.

2. Unfamiliarity with Programming Language
The final deliverable of this project requires us to publish our visualisations using D3.js which involve javascript coding, D3 library, HTML and CSS. Our group has started learning these programming languages and library recently. As our group members are from non-coding background, there is a steep learning curve. To bridge the gap between the expectations of the project and our programming ability, we will be looking into the published D3 visualizations code and learn best practices from these visualisations. This allow us to better understand the logic of the code and be able to use it to make our visualisations more interactive and meaningful to the end user.

3. Topic Modelling
In this dataset, it consists of large volume of unlabelled email headers. Different words are used even though they have similar meaning and theme. Thus, one of the first few steps of data preparation is to automatically classify the email headers into different themes. Due to our group’s unfamiliarity with programming language, we will be utilising commercial off the shelf tools such as JMP to help us in Topic Modelling instead of using Python.

Project Schedule

References

KPMG International. (2016). Global profiles of the fraudster: Technology enables and weak controls fuel the fraud. Retrieved from: https://assets.kpmg.com/content/dam/kpmg/pdf/2016/05/profiles-of-the-fraudster.pdf
Lanza, R. B. (2016, March). Blazing a trail for the Benford' s Law of words, part 1. Retrieved from: http://www.fraud-magazine.com/article.aspx?id=4294991850

Our Team

Group 13
1. Lim Hui Ting
2. Jonathan Eduard Chua Lim

@@ Line 27: / Line 27: @@
-Furthermore, fraudsters tend to work in groups rather than relying on their own. In 2015, 62 percent of fraudsters colluded with others(KPMG International, 2016). As 74 percent of the fraud is perpetrated by internal staff or a collusion between internal staff and external parties (KPMG International, 2016), this highlights the need for complex tools for fraud examiners to not only analyse available textual data of the firm but also visualise the interactivity among employees of an organisation. <br />
+Furthermore, fraudsters tend to work in groups rather than relying on their own. In 2015, 62 percent of fraudsters colluded with others (KPMG International, 2016). As 74 percent of the fraud is perpetrated by internal staff or a collusion between internal staff and external parties (KPMG International, 2016), this highlights the need for complex tools for fraud examiners to not only analyse available textual data of the firm but also visualise the interactivity among employees of an organisation. <br />
@@ Line 45: / Line 45: @@
 ==<div style="background: #ffffff; padding: 17px; line-height: 0.1em;  text-indent: 10px; font-size:17px; font-family: Helvetica;  border-left:8px solid #0091b3"><font color= #000000><strong>Data Source</strong></font></div>==
 <div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left">
-The dataset that will be used in this project can be retrieved from VAST Challenge 2014.
+The dataset that will be used in this project can be retrieved from
-[http://www.vacommunity.org/tiki-index.php?page=VAST+Challenge+2014%3A+Mini-Challenge+1&ok=y&iTRACKER=1#wikiplugin_tracker1 Link to Dataset].<br />
+[http://www.vacommunity.org/tiki-index.php?page=VAST+Challenge+2014%3A+Mini-Challenge+1&ok=y&iTRACKER=1#wikiplugin_tracker1 VAST Challenge 2014].<br />
 It mainly consists of GAStech employee records and email headers from two weeks of internal GAStech company email.
 </div>
@@ Line 58: / Line 58: @@
 |-
 | style="font-family:Open Sans, Arial, sans-serif; font-size:16px; text-align: center; padding:5px; border-bottom:solid #0091b3" | <font color="#3c3c3c"><strong>Screenshots</strong></font>
-| style="font-family:Open Sans, Arial, sans-serif; font-size:16px; text-align: center; padding:5px; border-bottom:solid #0091b3" | <font color="#3c3c3c"><strong>What can we learn</strong></font>
+| style="font-family:Open Sans, Arial, sans-serif; font-size:16px; text-align: center; padding:5px; border-bottom:solid #0091b3" | <font color="#3c3c3c"><strong>What we can learn</strong></font>
 |-
 | style="font-family:Open Sans, Arial, sans-serif; text-align: center; padding:3px 10px; border-bottom:solid 1px #d8d8d8" | <strong>
-<b>Parallel Coordinates of Employee Characteristics</b>[[File:Forensic Ninja ParallelVizTianjin.png|300px]]<br />Source:[https://www.youtube.com/watch?v=f-p5TA35nCo&feature=youtu.be&t=205 Link].Write-up:[http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202014/challenges/MC1%20-%20Disappearance%20at%20GASTech/entries/Tianjin%20University%20-%20Cai/ Link] </strong>
+Parallel Coordinates of Employee Characteristics [[File:Forensic Ninja ParallelVizTianjin.png|300px]]<br />Source:[https://www.youtube.com/watch?v=f-p5TA35nCo&feature=youtu.be&t=205 Link].Write-up:[http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202014/challenges/MC1%20-%20Disappearance%20at%20GASTech/entries/Tianjin%20University%20-%20Cai/ Link] </strong>
 | style="font-family:Open Sans, Arial, sans-serif; text-align: left; padding:3px 10px; border-bottom:solid 1px #d8d8d8" |
-* Parallel Coordinates can display the Employee Characteristics in the  Y axis, and the lines are the employees themselves
+<b>Pros:</b>
-* From this method, common characteristics amongst employees will be visible
+*Outlines the common characteristics clearly among employees
-* Common characteristics that can be seen are who went to military service together, wh
+*Use of parallel coordinates to better visualise common characteristics among employees
-* Which military branch they were in and how they obtained their citizenship.
+<b>Cons:</b>
+*May not be effective in showing number of employees in the company
+*Visualisation can be further improved by utilising more charts in the application and making it less wordy
+|-
+| style="font-family:Open Sans, Arial, sans-serif; text-align: center; padding:3px 10px; border-bottom:solid 1px #d8d8d8" | <strong>Visualization of social network formed from 60,000 emails from personal archive<br />[[File:Forensic NinjaChord Diagram.PNG|300px]]<br />Source:[http://christopherbaker.net/projects/mymap/ Link]</strong>
+| style="font-family:Open Sans, Arial, sans-serif; text-align: left; padding:3px 10px; border-bottom:solid 1px #d8d8d8" |
+<b>Pros:</b>
+*Use of chord diagram to better visualise the connectivity among senders and recipients.
+*Adequate spacing between labels and arranged in an orderly manner. This prevents overlapping of labels even if font size is increased.
+*Use of time filter at the bottom to visualise how the connectivity has changed over time.
+<b>Cons:</b>
+*Relationships among the senders and recipients might not be intuitive to the users due to the huge gap in space, created by the circle.
+*One colour of different intensity should be used instead of using different colours to represent the number of email exchanges between two parties since only positive numbers are observed in the dataset
+|-
+| style="font-family:Open Sans, Arial, sans-serif; text-align: center; padding:3px 10px; border-bottom:solid 1px #d8d8d8" | <strong>Thinkers’ perspectives with regards to topics discussed<br />[[File:Forensic Ninja Concept Map.png|300px]]<br />Source:[http://www.findtheconversation.com/concept-map/# Link]</strong>
+| style="font-family:Open Sans, Arial, sans-serif; text-align: left; padding:3px 10px; border-bottom:solid 1px #d8d8d8" |
+<b>Pros:</b>
+*There is a hierarchical structure within the visualisation, allowing data exploration with topics and the respective thinker of the topic.
+* Use of concept map to better visualise a thinker’s perspective based on topics discussed.
+* Appropriate highlighting when the user hover over the list of names in the middle of the diagram.
+* Use of appropriate animation and filters to allow users to further analyse the characteristics of the person that they are interested in.
+<b>Cons:</b>
+*The animation transition is too fast for users to catch the changes occurring. It may cause a distraction.
+*User friendliness can be further improved by providing a clear and more convenient way for users to get back to the initial concept map after analysing a certain thinker.
 |-
-| style="font-family:Open Sans, Arial, sans-serif; text-align: center; padding:3px 10px; border-bottom:solid 1px #d8d8d8" | <strong><b>Visualization of social network formed from 60,000 emails from personal archive</b><br />[[File:Forensic NinjaChord Diagram.PNG|300px]]<br />Source:[http://christopherbaker.net/projects/mymap/ Link]</strong>
+| style="font-family:Open Sans, Arial, sans-serif; text-align: center; padding:3px 10px; border-bottom:solid 1px #d8d8d8" | <strong>ASTRI Entry for VAST2014 MC1<br />[[File:Forensic Ninja ConnectionsBetweenPOKandGasTech.jpg|300px]]<br />Source:[http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202014/challenges/MC1%20-%20Disappearance%20at%20GASTech/entries/ASTRI/# Link]</strong>
 | style="font-family:Open Sans, Arial, sans-serif; text-align: left; padding:3px 10px; border-bottom:solid 1px #d8d8d8" |
-* Use of chord diagram to better visualise the connectivity among senders and recipients
+This displays the potential connection between GASTech Company and POK, a revolutionary force. This form of (Social Networks) Graph Visualisation can display how the two separate organisations are connected, and may shed light on how the kidnapping of the employees took place. <br />
-* Use of different colours intensity to represent the number of email exchanges between two parties
+<b>Pros:</b>
-* Use of time filter at the bottom to visualise how the connectivity has changed over time
+* Social network diagram clearly shows who is the broker of information, meaning who is connecting POK and GAStech together and allowing the flow of information.
+<b>Cons:</b>
+*This does not particularly show the frequency of emails sent from person to person. This visualisation could be turned into a weighted graph, with numbers representing the frequency of email between each person.
 |-
-| style="font-family:Open Sans, Arial, sans-serif; text-align: center; padding:3px 10px; border-bottom:solid 1px #d8d8d8" | <strong><b>Thinkers’ perspectives with regards to topics discussed</b><br />[[File:Forensic Ninja Concept Map.png|300px]]<br />Source:[http://www.findtheconversation.com/concept-map/# Link]</strong>
+|style="font-family:Open Sans, Arial, sans-serif; text-align: center; padding:3px 10px; border-bottom:solid 1px #d8d8d8" | <strong>KBSI VAST2014 MC1 Entry<br />[[File:Forensic Ninja VAST2014MC1EntrybyASTRO.jpg|300px]]<br />Source:[http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202014/challenges/MC1%20-%20Disappearance%20at%20GASTech/entries/Knowledge%20Based%20Systems%20Inc/# Link]</strong>
 | style="font-family:Open Sans, Arial, sans-serif; text-align: left; padding:3px 10px; border-bottom:solid 1px #d8d8d8" |
-* Use of concept map to show a thinker’s perspective based on topics discussed
+This is the timeline for the events occurring on 20 January 2014 based from key words from the email headers and news articles. For our project, this can be done for specifically the email headers.  Features such as a date slider can narrow down specific dates. The size of the words represent the frequency of the words mentioned in the emails. The timeline is in chronological order of when the email was sent. <br />
-* Appropriate highlighting when the user hover over the list of names in the middle of the diagram
+<b>Pros:</b>
-* Use of appropriate animation and filters to allow users to further analyse the characteristics of the person that they are interested in
+*Commonly mentioned words are shown clearly, with size displaying frequency mentioned
+<b>Cons:</b>
+*When the frequency of a word/phrase is too high, it may cause overlapping of the name labels, making the visualisations hard to understand.
 |}
 </center>
@@ Line 88: / Line 120: @@
 ==<div style="background: #ffffff; padding: 17px; line-height: 0.1em;  text-indent: 10px; font-size:17px; font-family: Helvetica;  border-left:8px solid #0091b3"><font color= #000000><strong>Storyboard</strong></font></div>==
 <div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left">
+[[File:storyboard1.png|600px|center]]
+Our first proposed story board consists of a chord diagram and a word cloud. The chord diagram allows users to visualise the interactivity among GAStech employees based on the sender and recipient information provided in the dataset. The thickness of the line will represent the number of email exchanges between the two parties. One colour will be used for the lines in the chord diagram since there are only positive numbers involved. Highlighting features will also be added to this chord diagram by highlighting the lines to allow users to better visualise the social network of one employee when the user hover over the name of that particular employee.
+Word cloud would allow users to visualise the frequently discussed topics in the emails among GAStech employees. One colour will only be used for this word cloud as the frequency of the words can be represented by the different font size. Furthermore, when the user hover over the names of the employees, the words that are used in the email exchanges of that particular employee will also be highlighted.
+Filters will be added to the diagrams to allow users to only analyse on a particular employee by clicking on his name on the chord diagram. In this case, all the information provided by the chord diagram and word cloud are only related to that particular employee. Likewise, users can also click on the words on the word cloud to find out the employees involved in the discussion of this topic. A time slider will also be added to allow users to visualise how the connectivity and frequently discussed topics among GAStech employees have changed over time.
 </div>
 ==<div style="background: #ffffff; padding: 17px; line-height: 0.1em;  text-indent: 10px; font-size:17px; font-family: Helvetica;  border-left:8px solid #0091b3"><font color= #000000><strong>Key Technical Challenges</strong></font></div>==
 <div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left">
-Firstly, one of the key technical challenge is that we will be working on two datasets, namely Employee Records and Email Headers. This is because there will need to be a connection created between the two databases, so it can be used effectively and simultaneously. A possible solution to this would be to link the two databases by using the Email address information that is both available in the two databases.<br />
+<b>1. Merging of Two Different Datasets</b><br>
+We will be working on two datasets, namely Employee Records and Email Headers. There is a need to have a connection created between the two databases so that it can be used effectively and simultaneously. A possible solution to this would be to link the two databases by using the Email address information column that is both available in the two databases.<br>
+<b>2. Unfamiliarity with Programming Language</b><br>
+The final deliverable of this project requires us to publish our visualisations using D3.js which involve javascript coding, D3 library, HTML and CSS. Our group has started learning these programming languages and library recently. As our group members are from non-coding background, there is a steep learning curve. To bridge the gap between the expectations of the project and our programming ability, we will be looking into the published D3 visualizations code and learn best practices from these visualisations. This allow us to better understand the logic of the code and be able to use it to make our visualisations more interactive and meaningful to the end user. <br>
+<b>3. Topic Modelling</b><br>
+In this dataset, it consists of large volume of unlabelled email headers. Different words are used even though they have similar meaning and theme. Thus, one of the first few steps of data preparation is to automatically classify the email headers into different themes. Due to our group’s unfamiliarity with programming language, we will be utilising commercial off the shelf tools such as JMP to help us in Topic Modelling instead of using Python. <br>
-Another key technical challenge would be to publish our Visualisations using D3.JS. To link the databases, there would be a form of javascript coding involved using the D3 library. Our group has started learning the programming language and library recently, and there is a steep learning curve. To bridge the gap between the expectations of the project and our programming ability, we will be looking into the code of published D3 Visualizations, and learn best practices from these visualisations. This is so to better understand the logic of the code and be able to use it to make our visualisations more interactive and powerful to the end user.
 </div>
 ==<div style="background: #ffffff; padding: 17px; line-height: 0.1em;  text-indent: 10px; font-size:17px; font-family: Helvetica;  border-left:8px solid #0091b3"><font color= #000000><strong>Project Schedule</strong></font></div>==
 <div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left">
-<center>[[File:Forensic Ninja Timeline.PNG|900px]]</center>
+<center>[[File:Forensic Ninja Timeline.PNG|500px]]</center>
 </div>
@@ Line 106: / Line 150: @@
 <div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left">
-# KPMG International. (2016). Global profiles of the fraudster: Technology enables and weak controls fuel the fraud. Retrieved from [https://assets.kpmg.com/content/dam/kpmg/pdf/2016/05/profiles-of-the-fraudster.pdf here]
+# KPMG International. (2016). Global profiles of the fraudster: Technology enables and weak controls fuel the fraud. Retrieved from: https://assets.kpmg.com/content/dam/kpmg/pdf/2016/05/profiles-of-the-fraudster.pdf
-# Lanza, R. B. (2016, March). Blazing a trail for the Benford' s Law of words, part 1. Retrieved from [http://www.fraud-magazine.com/article.aspx?id=4294991850 here]
+# Lanza, R. B. (2016, March). Blazing a trail for the Benford' s Law of words, part 1. Retrieved from: http://www.fraud-magazine.com/article.aspx?id=4294991850
-# 3
-# 4
-# 5
 </div>
@@ Line 122: / Line 163: @@
 ==<div style="background: #ffffff; padding: 17px; line-height: 0.1em;  text-indent: 10px; font-size:17px; font-family: Helvetica;  border-left:8px solid #0091b3"><font color= #000000><strong>Comments</strong></font></div>==
 <div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left">
+<!------- Credits to: https://wiki.smu.edu.sg/ANLY482/AY1516_T2_Team_CommuteThere---->
+<!--Please leave your comments here :) --->
 </div>
 <!-- End Body --->

Difference between revisions of "Forensic Ninja"

Latest revision as of 00:53, 12 October 2016

Contents

Problem and Motivation

Objectives

Data Source

References to Related Work

Storyboard

Key Technical Challenges

Project Schedule

References

Our Team

Comments

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools