Grp2 Proposal

From Visual Analytics and Applications
Revision as of 03:13, 4 December 2017 by Rachel.tong.2016 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Overview Proposal Poster Application Report


Motivation

Crime is prevalent in any society. In Singapore, the Singapore Police Force[1] provides a bi-annual and annual update on the crime statistics whilst the Ministry of Home Affairs[2][3][4] provides statistics on overall crime cases, crime rates, major offences and the victim profiles annually up till year 2015. Based on this data, visualizations are published by media sources such as The Straits Times[5], Channel NewsAsia[6] and Home Team[7], and provide a year-on-year comparison of crime rates by type of crime. In comparison, visualizations on crime statistics of the United States of America fare better through the provision of more information such as year-on-year comparison of crime rate by type and state[8][9][10].

Nonetheless, these crime visualizations only provide an overview of crimes and are not interactive for the user to obtain more details on the crimes such as neighborhoods, victim profiles and time. There is also no predictive visualization tools currently available for everyday use of members of the public and public agencies.

As such, this project serves the dual purpose of (i) improving upon current publicly available visualizations, and (ii) illustrating the benefits of releasing such data, in a bid to act as a driving force for similar action in Singapore; a move which would provide a great step forward for Singapore's Smart Nation Initiative, especially in further enhancing the law enforcement capabilities of our public agencies.

Objectives and Potential Benefits to Users

Our analysis attempts to fill the gaps highlighted through both an exploratory and predictive model bearing the following objectives:

(1) To provide a visualization platform for user-interactive data exploration of crime statistics
The dashboard will incorporate functionalities that allow for the exploratory analysis of the dataset based on variables of time, date (day, week, month or year), type of weapon, type of crime, victim profile and longitudinal location. The visualizations will allow the user to do their own comparison of crime such as across locations and periods of time.

(2) To develop a predictive model for crime patterns visualized in a geographical map for route planning
The model will predict the likelihood of a certain crime occurring on a certain day, time, and longitudinal location and affecting a particular victim profile.

Both models may be useful for both the public and law enforcement agencies as follows:

  • Members of the public: The public may use the information to make decisions such as choice of schools and purchase of homes in certain neighborhoods. Based on the predictive model, travelers can also plan their travel routes better to avoid routes that have a higher likelihood of certain crimes occurring at a particular time and affecting some that may match certain victim profiles.
  • Law enforcement agencies: This will facilitate more time- and location-appropriate deployment of patrol officers, in sufficient numbers and with suitable skills, allowing them to better respond to incidents. Usage of such information in planning manpower and patrol routes in particular, not only allows swifter response times, but also may further serve as deterrent to potential crimes.

The Dataset

The dataset provided by the Los Angeles Police Department (LAPD) comprises 1.59 million crime incidents in Los Angeles, California from year 2010 until 19 September 2017. The dataset is updated on a weekly basis and consists of 26 variables. The exploratory section of the project will involve all records in the dataset. For the purpose of prediction, the dataset will be split into a training and validation portion (incorporating records from year 2010 till mid-September 2016) and a testing portion (non-overlapping records from mid-September 2016 to mid-September 2017).

The 26 variables in the raw dataset have been summarized in the table below:

Variable Name Description
DR Number Division of Records Number: Official file number made up of a 2-digit year, area ID, and 5 digits
Date Reported MM/DD/YYYY
Date Occurred MM/DD/YYYY
Time Occurred In 24-hour military time.
Area ID The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
Area Name The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for.
Reporting District A four-digit code that represents a sub-area within a Geographic Area. All crime records reference the "RD" that it occurred in for statistical comparisons.
Crime Code Indicates the crime committed. (Same as Crime Code 1)
Crime Code Description Defines the Crime Code provided.
MO Codes Modus Operandi: Activities associated with the suspect in commission of the crime.
Victim Age Two-character numeric
Victim Sex F - Female M - Male X - Unknown
Victim Descent Descent Code: A - Other Asian; B – Black; C – Chinese; D – Cambodian; F – Filipino; G – Guamanian; H - Hispanic/Latin/Mexican; I - American Indian/Alaskan Native; J – Japanese; K – Korean; L – Laotian; O – Other; P - Pacific Islander; S – Samoan; U – Hawaiian; V – Vietnamese; W – White; X – Unknown; Z - Asian Indian
Premise Code The type of structure, vehicle, or location where the crime took place.
Premise Description Defines the Premise Code provided.
Weapon Used Code The type of weapon used in the crime.
Weapon Description Defines the Weapon Used Code provided.
Status Code Status of the case. (IC is the default)
Status Description Defines the Status Code provided.
Crime Code 1 Indicates the crime committed. Crime Code 1 is the primary and most serious one. Crime Code 2, 3, and 4 have decreasing severity. Lower crime class numbers are more serious.
Crime Code 2 May contain a code for an additional crime, less serious than Crime Code 1.
Crime Code 3 May contain a code for an additional crime, less serious than Crime Code 1.
Crime Code 4 May contain a code for an additional crime, less serious than Crime Code 1.
Address Street address of crime incident rounded to the nearest hundred blocks to maintain anonymity.
Cross Street Cross Street of rounded Address.
Location The location where the crime incident occurred. Actual address is omitted for confidentiality. XY coordinates reflect the nearest 100 blocks.

Visualization Deliverables

The deliverables may be classified into 2 main parts - an exploratory section, and a predictive section.

Exploratory Section

In accordance with Shneiderman’s ‘Overview first, zoom and filter, then details-on-demand’, the exploratory section will be delivered with interactive capabilities, allowing the user to visualize data at various levels and across different dimensions.

Overview based on the 5 exploratory variables

Different angles of visualization will be provided. This could include aggregate plots(e.g. radar charts) of variables over levels like a fixed window period (e.g. within a day, month or year), to better understand how the frequency of crime occurrences vary with each variable value.

Time-based small multiples of choropleth map

The use of the small multiples format for the choropleth map according to the Los Angeles community areas will allow the user to compare crime occurrence density across varying community areas and time periods (e.g. across years). A filtering feature will also be built-in to allow for filtering based on the crime occurrence density by victim descent and crime description the user is interested in. This will allow the user to observe any crime occurrence trends within each community area for each variable included in the filter, as well as any trends across time.[11]

Radar chart

The radar charts served to provide an additional level of detail that complements the choropleth small multiples time aggregate plots. These plots would take the same set of years or month-years selected by the user and can thus be compared with each other to look for any possible trends in the data. With the radar chart, the user can compare multiple variables - by different crime types and by specific category and crime type.

Calendar plot

The calendar plot gives a view of the crime distribution in the city of LA over the span of a day by hour. This calendar plot will be linked to a choropleth where clicking of a grid on the calendar plot will then display distribution of crime occurrence across all community areas.

Network graphs

The use of multimodal network graph will allow us to easily observe any points of associations in the features of each occurrence of crime, by observing which antecedent and consequent node (representing different features of the crime) are joined by rule nodes: these will add on to the basic understanding of the landscape of crimes in Los Angeles as depicted by the aggregate plots, and allow for identification of areas where further investigation might be required.

Predictive Section

The predictive section involves a prediction of the likelihood of falling prey to/becoming a victim of crime based on the values of a select few variables from the dataset - this will allow users of the application to identify any traits or conditions that are favourable to the occurrences of crime.

Analytical and Visualization Packages

The following is a tentative list of packages* in R that are relevant to the scope of the project:

ggplot2

ggplot2 is a well-established graphical package that provides a more systematic means of plotting graphs via leveraging on the grammar of graphics. Given the extensive involvement of visualization, this will be the central package used across the various parts of the project.

ggmaps

ggmaps, which is a separate function but builds upon the layering structure established in ggplot2, will provide our application the functionality to display the underlying map information, and allow us to access the google maps server.

tmaps

Using a similar layering structure seen in ggplot2, the tmaps package offers effective visualisation for choropleth maps, and affords the end-user more details on viewing of the plot through its ability to incorporate elements commonly required in maps (e.g. compass points).

radarchart

Built on the chartjs package, the radarchart package plots aesthetically appealing interactive radar charts that allow for the display of tooltips, which will allow the end-user of the dashboard to obtain more information about each point in the radar chart on demand by hovering over the point of interest.

SpatialEpi

As the concept of Empirical Bayes Estimation would also be applied in our project in order to visualise the relative risk of falling prey to crime in different areas, the Empirical Bayes function in the SpatialEpi package would be used in the calculation process.

visNetwork

As the association rules mined would be visualised through a network graph, visNetwork would be used to allow user manipulation of the nodes in the graph through manual dragging of nodes as well as the ability to zoom in to the plot to view the node of interest. As the network would be multimodal due to the presence of antecedent, consequent and rule all in one graph, the ability of the package to plot multiple symbols for the different modes in the graph would prove to enhance the user experience by preventing confusion between the different types of nodes.

ggiraph

As an interactive wrapper over the ggplot2 package, ggiraph provides interactivity to otherwise static ggplot2 plots, allowing the interactive filtering of multiple plots at the same time for a better user experience.

Shiny

The visualizations will be built as a web application using Shiny that will thus allow for user interactivity. The interactive features include sliders, dropdown menus, date range inputs and zoom-ins.

xgboost/caret

The caret library package was used in conjunction with the xgboost package to construct a predictive model for the probability of crime occurrence based on the extreme gradient boosting (xgboost) algorithm. This was chosen due to the ability of the model to handle categorical variables (which our dataset was rich in). Additionally, compared to the traditional gradient boosting algorithm (package 'gbm' in R), xgboost affords relatively more 'efficient and scalable' performance. Results were visualised using ggplot2 (ggplot).

plotly

Plotly was used to wrap ggmap in the application to provide tooltip information, since the base hover/tooltip option of shiny's plot rendering function currently mainly support base R graphics, and are not fully compatible with grid-based graphics like ggplot and tmap.



*Note: This list will be updated accordingly as the project progresses, depending on the suitability and extent of use of various packages in practical implementation.

References

[1] https://www.police.gov.sg/news-and-publications/statistics
[2] https://data.gov.sg/dataset/victims-of-selected-major-selected-offences
[3] https://data.gov.sg/dataset/islandwide-cases-recorded-for-selected-major-offences
[4] https://data.gov.sg/dataset/overall-crime-cases-crime-rate
[5] http://www.straitstimes.com/singapore/courts-crime/spike-in-online-scams-but-overall-crime-rate-still-low
[6] http://www.channelnewsasia.com/news/singapore/crime-rate-down-in-2016-but-online-scams-remain-a-concern-7623920
[7] https://www.hometeam.sg/article.aspx?news_sid=20160212zTCp2vhJHNa0
[8] https://ucr.fbi.gov/crime-in-the-u.s/2011/crime-in-the-u.s.-2011/offenses-known-to-law-enforcement/standard-links/region
[9] http://www.ncpc.org/resources/enhancement-assets/charts-and-graphs/uscrimestatistics010708.jpg/view
[10] http://www.huffingtonpost.com/brian-beltz/crime-at-the-top-100-colleges-in-the-us_b_6432864.html
[11] http://www.juiceanalytics.com/writing/better-know-visualization-small-multiples/