Group18 Report

From Visual Analytics and Applications
Revision as of 23:35, 13 August 2018 by Vigneshwarv.2017 (talk | contribs)
Jump to navigation Jump to search

A sanctuary for women – Is there one?


OVERVIEW

PROPOSAL

REPORT

POSTER

APPLICATION

BACK

 

Motivation

Women in India faces the challenge to live safely in the democratic country from the time of birth. Cultural difference and peculiarities, male domination, skewed sex ratio, age-old customs like Sati and Dowry and the lower status women hold in society leave women at higher risk to become victims of violence. The myth has been created across the world that India is unsafe for women and travel, through our research we would like to analyse the situation and crime hotspots in the state and district levels. Through our research

we believe to provide a holistic perspective of crimes against women in India using time series, geo-spatial and relationship analytics for better insights on crime occurrence in India. This research aims to capture the following analysis:

a) Create a user friendly and interactive visualization platform for data exploration and trend analysis of the Women crime pattern over the years 2001 -2014 in the state and district level. b) Visualize the geospatial view of the number of incidents and location quotient variations within a state & district level for a specific year and crime in India. c) Analyse the relationship plot of specific crime, total crime against women and social factors influence in the states of the country.


Review and Critique of Prior Work

Tinniam V Ganesh (2015) developed shiny app to visualize crime against women in India for the period of 2001 to 2012 using chloropleth map and linear model to project future incidences of crime in each state. The app allows the user to select year and crime type for chloropleth map. Similarly, Open government data (OGD) platform India also used chloropleth map to visualize total crime against women in each state for the year 2013. The areas in chloropleth map were shaded in proportion using absolute count of crimes in each state for the selected year. Using absolute value of crime for shading undermines the truth about crime incidences in the region with respective to its actual population distribution. It is possible for a state with less population to have less number of crime incidences. Using a relative proportion of crime in a state to its actual population would give more intuitive information in understanding crime pattern across different states in India. Our work tries to overcome this issue by using a cartogram plot and location quotient which considers relative proportion of crime to population in each state to shade areas in the map


Data Cleaning, Preparation and Modeling

Indian Women Crime dataset was obtained from the National Crime Records Bureau (NCRB), Govt of India official website. The Indian census data of year 2011 is supplemented with crime data to provide holistic view of relationship between crime and social factors at district level across the country. The shape files for India has been taken from GADM website with administrative layer 1 and 2 for states and districts respectively.

The complete dataset consists of 8629 observations of women crime occurrences recorded in 29 States, 7 Union territories across 640 districts in India between the year 2001 to year 2014. Seven types of women crime recorded uniformly across all the years was used to prepare the full dataset. The Indian Census data was used to obtain six new variables namely Population, Male Population, Female Population, Literacy, Male Literacy and Female literacy across the 640 districts in the country.

Due to the different representation of the district names in both the datasets, manual renaming of the district names in crime data was performed before meaningful analysis or visualization could be carried out. The individual state and district names were matched to the respective columns in the crime data to obtain the external factors values for each district. The data was aggregated in the state and district level in the R shiny computing environment to suit the needs of the data analysis.


Design Framework

In designing the framework and visualization, we have followed through an iterative process of designing, development and visualization to include granular details with interactive features to make our application convey the truth hidden behind the data. We have included overview, zoom and filter details on demand in our application to add value to the end user. The user will be able to go through stage by stage analysis of time series, geospatial and relationship to understand the women crime occurrence in India.

1.1 Time Series Analysis

Time series plots serves to provide the user with an overview of historical pattern of crime cases occurrences across different states and drill down further at district level for each selected state based on the crime type.

1.2 Geo- Spatial Analysis

This section of the analysis focusses on the crime cases distribution in different states and districts of India through Chloropleth and Cartogram visualizations. The GADM Indian shape file was taken for plotting the boundaries for the country using two layers: administrative layer 1 for states and layer 2 for districts. The state names which have mismatch have been recoded in R according to the shape file for accurate matching by the state name.

As the absolute values of crimes will not represent the crime occurrence with respect to the country, a new layer parameter named Location Quotient has been derived. Location quotient (LQ) is a valuable way of quantifying how concentrated a particular crime is in a region as compared to the nation. The LQ calculated accounts for the crime occurrence with respect to the nation and also the population in the region with respect to the nation providing a relative scale statistical measure. LQ for each state for a specific crime is derived using the below formula:

Like the above LQ, the parameter is derived for the district level using the below formula to derive clearer picture of the crime occurrences to compare between the districts in the state.

The geo spatial plots represent both the absolute values of the crime and location quotient values in the Indian map as described in detail in the below sections. Using sf R package, the Indian spatial file was read using st_read function to import into R as a simple feature data frame.

5.1.1. Geofacet and Facets graphs

Geofacet package was leveraged to showcase changing trend in number of crime cases at state level for period of 14 years. It provides capability to arrange facet representing each state in its respective geographical positions. By selecting particular crime type and y – axis scaling type, user will be able to see crime cases trends for all states in a single view. Furthermore, the user can also select a particular state to view its changing trend pattern in detail in a separate interactive plot which display the data labels on hover.

To view crime pattern at district level, geofacet grid did not have any pre-defined district facet grid for Indian states. Alternate option available was to customize district facet positioning for each state manually which was very tedious as we had 640 districts in total to be arranged. Therefore, we used ggplot2 package to place each district facet adjacent to each other to show historical crime trends. The user can select the state and the crime type of interest to view district facets with respective trend.

5.1.2. Slope graph

The slope graphs were used to show change over time between two fixed year, in our case, 2004 and 2014. These graphs would take selected crime type by the user to plot slope graph to observe whether there is a growing or declining trend across two period for each state. The ggplot2 package was exploited as it provides customizable slope graphs in terms of aesthetics and ease of usage. The data used for this plot had different range of value for each state and labels of some states cluttered towards the bottom as they shared similar values. To avoid this problem, data was transformed in a log scale before plotted using ggplot which allowed normalizing scale value and spacing out y – axis label of state names. A summary table is provided for the user to obtain actual value of crime occurrences in 2004 and 2014 for each reporting state along with slope graph.


5.2.1. Chloropleth Plot

Chloropleth plot is a thematic map used to distinguish the regions with different shades in proportion to the statistical variable displayed on the map. In our case, we have considered the absolute values of crime and the location quotient as parameters in two chloropleth plot placed adjacent to each other in the district level. This is useful to visualize the absolute crime occurrence measure in region and the relative LQ measure in the region side by side.

The Indian district level data is merged with the crime data using the State and district as join keys. This is required as the district names are unique to each state in the country. In the case of location quotient, manual binning of 5 ranges were done to distinguish the regions with location quotients below and above of value 1. The maximum and minimum values were used to create two bins before and after location quotient value 1. Using the tmap package, the chloropleth for absolute crime values was created to have five bins using the quantile as style to shade the regions whereas for LQ the breaks were manually set to create better visualization.

Both the maps were rendered using the tmap and leaflet package in R allowing to have more interactivity features in the dashboard like zooming in and out, hovering over the district the crime occurrence and LQ values pops out respectively. By selecting particular state, crime type and year, user will be able to see distribution of the crime occurrences in the geospatial view for all the districts in the state for both absolute and LQ measures. The chloropleth plot was also created for the state level LQ values in the India map for the user selected crime type and year. These maps help to determine the crime hotspots against women for specific crime type in the state and national level.


5.2.1. Cartogram

The reason for selecting the cartogram package in R is that cartogram represents a unique type of map as it combines statistical information with geographic location. The area cartogram uses a measurable variable to manipulate a place’s area to be sized accordingly. Cartogram visualizations is commonly used to portray geographic or social data like the human populations in the countries of the world.

As in our research, the chloropleth plot showcases the crime occurrences using shading in the district level for absolute and LQ values, a better visualization is created using cartogram to take into account geographical location and crime occurrence. The shape file is read using the read_OGR function to create a spatial dataframe object using the rgdal package. It is merged with the crime data and converted into Spatial Polygon dataframe for map projection using the sp_transform function in R. Based on the user selected crime type and year, the cartogram map is plotted in the state level to represent the absolute crime values in the Indian map. Different variations of cartogram namely continuous, non-contiguous and non-overlapping circles(dorling) cartogram have been created using the cartogram with the tmap package.

Due to long processing time of cartogram variations, the shiny app has been built using the dorling cartogram variation. The plot gives the crime occurrences in a state by the size of the circle in the corresponding geographical location. The cartogram for the absolute crime value and chloropleth for the location quotient showcases the different kinds of geospatial plots in the state level. Both the plots have been placed next to each other and synchronized using the sync function under mapview package to hover over the specific location and derive insights on the absolute crime occurrence and location quotient values.


The cartogram has three variations that can be developed using R namely continuous, non-contiguous and non-overlapping circles area cartograms. The continuous cartogram is formed by specifying the iterations which makes it longer to process and render the plot. So, in the shiny web application the Dorling circle cartogram has been implemented as there are many combinations of inputs from user like crime type and year. To avoid the performance delay in the application to render the visualization each time, dorling is chosen.


5.3 Relationships Analysis

5.3.1. Funnel Plots

The funnel plot allows user to accurately detect variation in crime incidences across each state level foe selected crime type. It is a statistical method in the form of scatter plot to know pattern of crime type against total crime level at each state. Funnel R package was used to produce funnel plot for each crime type and total crime for each state. It takes number of crimes for a particular type and total number of crimes in each state as input to evaluate z -score and perform necessary plotting using confidence interval of 80% and 95% confidence interval. New variable known as total number of women crimes in each state was created using Dplyr package. As the data for some states were found to be skewed, it was transformed to log scale before plotting. Ggplot2 was then utilized to improve aesthetics elements of funnel plot such label for those states above upper bound, repel the text of label from overlapping with each other.


The user interfaces allow user to select year and crime type of interest to identify extreme outliers among states that are found above upper bound confidence interval of 95% line.

5.3.2. Chloropleth Plot

Chloropleth plot was used in addition to funnel plots to illustrate findings from funnel plot in geo- spatial form. The geo – spatial representation allows the user to view those outliers present above 95% confidence interval plotted and shaded in their respective geographical position along with its intensity. This choropleth can help the user to look for patterns / influence existing among outliers identified for each crime type in whole of nation.


5.3.3. Clustered Heat Maps

Clustered heat map shows variance across multiple variables, revealing any patterns, displaying whether any variables are similar to each other, and for detecting if any correlations exist in-between them. It was used to identify correlation between crime types such as rape, kidnapping, domestic violence, dowry death, total crime, literacy rate and sex ratio across different states in India. Heatmaply package offers user friendly interactive cluster heatmap with tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels. The user can calibrate the clustered heat maps by selecting year, number of clusters, type of data transformation, hierachical clustering algorithm to visualize how external factors contribute to crime occurences across different states. The plot also allows user to view clusters formed among states with similar patterns in terms of crime type and external factors for a selected year. A summary table provided for the user to look for actual value of each of crime type and external factors used for this analysis.


Future Scope

  • Continuous cartogram can be implemented at state and district level with better performance and interactive features.
  • Additional socio – economic factors to be incorporated to get a wider view of external factors influence on crime against women.
  • With availability of recent year data from NCRB from 2015-2017, deeper understanding of recent situations can be inferred.