Group02 Report

From Visual Analytics and Applications
Revision as of 22:49, 3 December 2017 by Nurula.a.2016 (talk | contribs)
Jump to navigation Jump to search
Overview Proposal Poster Application Report


Introduction

Environmental criminology focuses on the relations between crime (including aspects such as victim characteristics and criminality) and spatial and behavioural factors. As crime data becomes increasingly available to the public, geo-spatial and temporal analysis of crime occurrence matures to provide better insights. This increased understanding will potentially contribute to enhanced law enforcement efforts and even urban management.

In our research, we take a step in this direction by examining how geographic and date-time variables interact with other variables to better understand crime occurrences in the city of Los Angeles (LA). Crime data coupled with population by zip code were obtained from the LA city official data repository for analysis and visualization. The research culminates in an interactive application built on R Shiny that allows a casual user to explore, analyse and model data to derive insights. R is used as the tool of choice in creating the web application due to its rich library of packages for statistical analysis and data visualization. With the data visualizations and intuitive user interface in this application, the user can easily filter and transform crime data to derive the insights he or she requires. R’s status as a free software environment for statistical computing and graphics allows for availability for use by many, which would further encourage the spread of such visual analytics initiatives across more fields.

This paper provides information on our analytical development efforts for the application and consists of 8 sections. The introduction is followed by the motivation and objectives of this research. Section 3 provides a review on previous works in the field. Section 4 describes the dataset and its preparation for modelling. Section 5 describes the design framework as well as visualization methodologies whereas section 6 provides insights we have derived in the process of the development of the application. Future works are stated in section 7 and finally, an installation and user guide in section 8.

Motivation and Objectives

Governmental agencies in Singapore such as data.gov and Ministry of Home Affairs provide crime data reports on a bi-annual and annual basis that displays trend for instance by crime type and across year. Even so, these data only provide an overview of crimes and there is no information on crime details (e.g. location, time), victim profiles and possible associations between the different crime variables. Our research aims to incorporate geo-spatial and temporal analytics for better insights on crime occurrence modelled using the rich data of crime occurrences in Los Angeles that may be replicated with increased availability of similar data in Singapore.
This research aims to:
(a) Create a user-friendly and interactive visualization platform for data exploration that supports both macro and micro views that can be potentially used by members of the public and law enforcement agencies alike
(b) Provide statistical analysis on crime occurrences with data on population and location spatial area
(c) Build a predictive model of crime occurrence based on geo-spatial temporal data, crime details and victim profiles

Previous Works

Due to the large set of variables available in the Los Angeles crimes dataset, it was expected that a wide range of analysis and visualisations would be available for it. An example of this can be found at CrimeMapping.com (https://www.crimemapping.com/Share/dd1a50e5fa4d4da4a41c8989c6ee791d), which plots crimes across the city in LA on a map, with options to filter the input dataset by the type of crime, as well as certain location and search radius, or by date of occurrence. The feature plots will then drop pins of type of crimes selected on the map based on the user input criteria (Figure 1).

Fig 1.png
Figure 1. Geographical map view of the crime sharing website

The same website also features a link to relevant plot summarizing the data shown above; where crimes displayed in the map are aggregated in the forms of a stacked bar chart and a pie chart (Figure 2). Such plots are typically not ideal for representation of data, as they do not allow for quick and easy deduction of which crime type is the most prevalent during the period, which we have surmised to be the aim of the plots in this section. Also, while the map with the built-in filers is highly useful, it does not make use of the whole dataset for statistical analysis, but focuses only on the frequency of occurrence. As such, through our project, we aim to build in a more holistic view of the crimes occurring in LA through incorporation of more features of each crime (such as victim profile, premise description) that are modelled with select statistical analytical methodologies and visualized with more effective charts.

Fig 2.png
Figure 2. Charts available on crime-sharing website

The same website also features a link to access the plot shown above, which are an aggregate of the crimes displayed in the map being represented in the forms of a stacked bar chart and a pie chart. Such plots are typically not ideal for representation of data, as they do not allow for quick and easy deduction of which crime type is the most prevalent during the period, which we have surmised to be the aim of the plots in this section. Also, while the map with the built-in filers is highly useful, it does not make use of the whole dataset for statistical analysis, but focuses only on the frequency of occurrence. As such, through our project, we aim to build in a more holistic view of the crimes occurring in LA through incorporation of more features of each crime (such as victim profile, premise description) that are modeled with select statistical analytical methodologies and visualized with more effective charts.

Nolan III (2004) [1] established the relationship between crime rate and population size based on crime data and population of the state of California. In his research, the author calculated the observed crime rate and the expected crime rate of each jurisdiction in California, weighted by the population within each jurisdiction. The crime rates are expressed as the frequency of crime per 100,000 inhabitants in the population. Meanwhile, there has been extensive research on disease mapping through Empirical Bayes Estimate of relative risk (Clayton & Kaldor, 1987 [2]; Leyland & Davies, 2005[3]). Our research amalgamated these by performing an Empirical Bayes Estimate of posterior relative risk of crime occurrence in each Los Angeles Police Department (LAPD) reporting district by incorporating the population data in each district.

Meanwhile, Kernel Density Estimation has been used extensively in many research to identify hotspots of certain occurrences. Xie and Yan (2008) [4] used kernel density estimation for traffic accidents in the Bowling Green, Kentucky Area over a 2-D geographic space. Yano and Nakaya (2010) [5] incorporated the element of time in their Kernel Density Estimation of crime occurrence in Kyoto to identify crime clusters. Our research also utilizes the Kernel Density Estimation to analyse crime hotspots in LA, with functions for the user to choose date range for comparison in a 2-D geographic map, location of interest and area of interest surrounding that location.

Market basket analysis through association rule mining is typically used by retailers to understanding the purchase behaviours of their customers. However, Siti Azirah Asmai, Nur Izzatul Abidah Roslin, Rosmiza Wahida Abdullah & Sabrina Ahmad (2014) [6] developed a model based on association rule mining to map crime based on geographical and demographic variables to evaluate crime occurrence at specific locations. Our research also implements association rule mining to assess crime occurrence, but our model was expanded beyond the variables in that research to include crime type, weapon, premise and temporal data such as date occurred and day of the week.

Ozkan (2017) [7] built prediction models based on machine learning algorithms for the tendency of a convicted criminal to reoffend. In his research, Ozkan (2017) compared the accuracy of the models built using logistic regression against the machine learning algorithms that include random forests, support vector machines, XGBoost, neural networks and Search algorithm. XGBoost and neural networks outperformed the other predictive models. Our research thus implements the XGBoost algorithm to predict crime occurrence based on time category, reporting district area, premise, crime type, day of the week, gender, and age group of a person

Dataset and Data Preparation

Both the LA crime and population data were obtained from #dataLA (https://data.lacity.org/).

LA City Crime Data

The full dataset consists of 1.6 million observations of crime occurrences between year 2010 to year 2017, with 26 variables defined for each occurrence of crime (Full list of variables and descriptions available in Appendix A), which includes variables such as the dates and times of the crime occurrences, victim profiles, as well as areas and locations of crimes occurrences. Our research only utilised data ranging from 1 January 2014 to 30 September 2017.
Due to the granularity of certain fields in the data such as the crime description, premise description, victim age, victim descent and time of crime occurrence, reclassification of these variables had to be done before meaningful analysis or visualization could be carried out. Categorical variables such as crime descriptions, premise descriptions and victim descent were hence manually regrouped to a smaller number of segments with crime descriptions based on the convention as suggested in the draft International Classification of Crimes for Statistical Purposes by the United Nations Office on Drugs and Crime (Aug 2014). Interval variables such as the victim age and time of crime occurrence were binned – the binning of the time of crime occurrence was based loosely on the shift times of the LAPD should they be using the application for their deployment purposes.

LA City Population

The LA city population obtained was according to zip codes from 2010 census data. The individual zip codes were matched to the respective LAPD reporting districts and aggregated them to obtain the population in each reporting district.

Design Framework and Visualisation Methodologies

Insights Derived

Future Works

Installation and User Guide