Difference between revisions of "Forensic Ninja"
Line 92: | Line 92: | ||
==<div style="background: #ffffff; padding: 17px; line-height: 0.1em; text-indent: 10px; font-size:17px; font-family: Helvetica; border-left:8px solid #0091b3"><font color= #000000><strong>Key Technical Challenges</strong></font></div>== | ==<div style="background: #ffffff; padding: 17px; line-height: 0.1em; text-indent: 10px; font-size:17px; font-family: Helvetica; border-left:8px solid #0091b3"><font color= #000000><strong>Key Technical Challenges</strong></font></div>== | ||
<div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left"> | <div style="margin:0px; padding: 10px; background: #f2f4f4; font-family: Open Sans, Arial, sans-serif; border-radius: 7px; text-align:left"> | ||
− | + | <b>1. Merging of Two Different Datasets</b><br> | |
+ | We will be working on two datasets, namely Employee Records and Email Headers. There is a need to have a connection created between the two databases so that it can be used effectively and simultaneously. A possible solution to this would be to link the two databases by using the Email address information column that is both available in the two databases.<br> | ||
+ | <b>2. Unfamiliarity with Programming Language</b><br> | ||
+ | The final deliverable of this project require us to publish our visualisations using d3.js which require javascript coding involving the d3 library. Our group has started learning the programming language and library recently, and there is a steep learning curve. To bridge the gap between the expectations of the project and our programming ability, we will be looking into the published d3 visualizations code, and learn best practices from these visualisations. This allow us to better understand the logic of the code and be able to use it to make our visualisations more interactive and meaningful to the end user. <br> | ||
+ | |||
+ | <b>3. Topic Modelling</b><br> | ||
+ | In this dataset, it consists of large volume of unlabelled email headers. Different words with similar meaning and theme are used. Thus, one of the first few steps of data preparation is to automatically classify the email headers into different themes. Due to our group’s unfamiliarity with programming language, we will be utilising commercial off the shelf tools such as JMP to help us in Topic Modelling instead of using Python. <br> | ||
− | |||
</div> | </div> | ||
Revision as of 22:34, 8 October 2016
Contents
Problem and Motivation
Benford’s Law has been widely used by forensic data analysts to detect anomalies or possible fraudulent activities in an organisation. However, in the world of information, majority of the data are textual fields. For example, in an accounts payable, 70% of the data are textual data whereas only 10% of the data are numerical fields (Lanza, 2016).
Furthermore, fraudsters tend to work in groups rather than relying on their own. In 2015, 62 percent of fraudsters colluded with others (KPMG International, 2016). As 74 percent of the fraud is perpetrated by internal staff or a collusion between internal staff and external parties (KPMG International, 2016), this highlights the need for complex tools for fraud examiners to not only analyse available textual data of the firm but also visualise the interactivity among employees of an organisation.
As email is one of the preferred modes of business communication in an organisation, analysing emails can help to uncover any potential red flags in the organisation structure or culture. By using GAStech organisation email exchanges as a case study, we seek to analyse the connectivity and frequently discussed topics among employees of an organisation.
Objectives
In this project, we seek to build an interactive visualisation application that helps users to analyse connectivity and frequently discussed topics among employees of an organisation. This allows users to better visualise the organisation structure and interactivity among the employees that might suggest potential wrongdoings.
By using GAStech organisation email exchanges as a case study, the application aims to help users the following:
- Understand GAStech organisational structure
- Analyse frequently discussed topics among GAStech employees
Data Source
The dataset that will be used in this project can be retrieved from
VAST Challenge 2014.
It mainly consists of GAStech employee records and email headers from two weeks of internal GAStech company email.
References to Related Work
Screenshots | What can we learn |
Parallel Coordinates of Employee Characteristics |
|
Visualization of social network formed from 60,000 emails from personal archive Source:Link |
|
Thinkers’ perspectives with regards to topics discussed Source:Link |
|
Storyboard
Key Technical Challenges
1. Merging of Two Different Datasets
We will be working on two datasets, namely Employee Records and Email Headers. There is a need to have a connection created between the two databases so that it can be used effectively and simultaneously. A possible solution to this would be to link the two databases by using the Email address information column that is both available in the two databases.
2. Unfamiliarity with Programming Language
The final deliverable of this project require us to publish our visualisations using d3.js which require javascript coding involving the d3 library. Our group has started learning the programming language and library recently, and there is a steep learning curve. To bridge the gap between the expectations of the project and our programming ability, we will be looking into the published d3 visualizations code, and learn best practices from these visualisations. This allow us to better understand the logic of the code and be able to use it to make our visualisations more interactive and meaningful to the end user.
3. Topic Modelling
In this dataset, it consists of large volume of unlabelled email headers. Different words with similar meaning and theme are used. Thus, one of the first few steps of data preparation is to automatically classify the email headers into different themes. Due to our group’s unfamiliarity with programming language, we will be utilising commercial off the shelf tools such as JMP to help us in Topic Modelling instead of using Python.
Project Schedule
References
Our Team
Group 13
1. Lim Hui Ting
2. Jonathan Eduard Chua Lim