Difference between revisions of "Group 3 Report"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 67: Line 67:
 
It is the criteria to optimize bandwidth selection. By default, we use minimum AIC value. You can also choose cross validation as the bandwidth selection criteria.<br>
 
It is the criteria to optimize bandwidth selection. By default, we use minimum AIC value. You can also choose cross validation as the bandwidth selection criteria.<br>
 
=Test Sample of Regression Modelling Application=
 
=Test Sample of Regression Modelling Application=
 +
At the beginning you may notice that there are several tabs on the top of the interface. The first tab is for raw data uploading. The second one is for initial data exploration and variable selection. The model results will be generated in the third tab.<br>
 +
The upload tab allows you to upload Chinese statistic data and explore the data structure.<br
 +
[[File:Uploadcsv.png|thumbnail|center]]
 +
[[File:csvexplore.png|thumbnail|center]]

Revision as of 22:56, 2 December 2017

Background

Imbalance of economic development has become a long-lasting issue in China. Benefited by the geographic location as well as the national policy deployed in 1980's, east coast areas in China, especially Shanghai, Zhejiang and Jiangsu have grown at an incredible speed during last few decades. The economic growth in east China shows a geographic radiation pattern, and contributors for GDP are different in every area.
In the project, we will use regression model and focus on researching the different GDP indicators in east China.We will use R to build an interaction application so that users could feel easy to explore their interested economic contributors.

Data description

The data set we are using includes 2 parts:
1. GDP and indicators data
We have downloaded the statistic data of 78 regions in China.The variables includes GDP volume (including total GDP and GDP for each industry ) and more than 20 variables that we think might be potentially influence the GDP volume.
2. Shape files
The shapefile of Chinese region (CHN_adm_shp) is available on ERSI (which is an organization providing geographic information system). The shapefile includes 3 levels. In our project, we are using the 2nd level of the shapefile ("prefecture-level city").

Analysis flow

In our research, both linear regression model and geo-weighted regression model will be used in analyzing the effects of each indicators.
The analysis flow basically includes 3 parts:
1. Data exploration and variable selection
Includes variable correlation and distribution matrix which enables users to exclude those highly-related variables in the regression model. Par-coordinate chart is also provide for users to have a general impression of these correlations.
2. Modelling and visualization
In the out of the regression model, we will display the parameter estimates of each variable, as well as its significant level, which is calculated by the p value.
3. Data analysis
We will use the interface to analyze the different effects from selected indicators.

Tools used

The application is built using R language. The advantage of R is its ability in analysis. This would be beneficial especially in the future model evaluation. R also provides an interactive application function which is helpful to visualize the analysis process.
Listed below are the R packages we used in building the application:
Rgdal package
Rgdal is a widely used package for reading shapefiles. The shapefile we are using is “CHN_adm_shp”, which is provided by ESRI. The selected level is “CHN_adm2”, meaning prefecture level region.

Package rmapshaper
The ms_simplify function allows us to simplify the outline of the shapefile. The simplified shapefile will largely improve the processing efficiency when we are building statistic maps.

GWmodel package
Package Gwmodel is an R package which allows users to realize spatial data analysis. The package provides GW summary statistics, GW principle component analysis, GW discriminant analysis and various GW regression models [--R GW model documentation]. The advantage is that it integrates parameter estimates with adjust test results which gives p value as significance indicator of the model.

Leaflet package
This is the R version of leaflet.js, which is an interactive map interface using htmlwidgets enabling users to adjust and interact with the map objects.

Shiny package
Shiny package provides a reactive web application so that users can manipulate the data and focus on the analysis.

Model Explanation

1.Traditional linear regression

When applying traditional linear regression model, each location is treated as an observation. The regression line line is optimized by least R2 value, and will return coefficients and intercept value for each variable.
When running ordinary linear regression model, we should notice these points:
(1) There is only one global regression model returned for the whole area, meaning that every region is sharing the same parameter estimate. It is not quite reliable when analyzing the area with different economy patterns.
(2) Linear regression fails to take mutual effects of neighboring places into consideration. Each observation is treated equally. But in real cases, the mutual effects between neighboring regions is one of the major contributors to local economy. We should also consider about this effect in our model.

2. Geographic weighted regression:

Weighted regression is a methodology that takes the nearing observations of each city into the model-building. Like for city A, its nearing areas will be given a weighted value, and all these selected observations will be input in building the individual model for observation x1.
Geographic weighted regression is a methodology of explaining geo-spatial static patterns. As its name suggests, it is a sub-category of weighted regression. Unlike weighted linear regression (where the distance is based on the x value), the “distance” of geo-weighted regression is based on the actual geographic distance.
When running the geographic regression model, besides a global regression model (just as the linear regression), individual regression model will also be built for each observation. The individual model is based on the observation and its neighboring areas within the “distance”, and for neighboring areas, weighted values will be given based on how far away from this observation.
For geographic regression, there are several parameters to modify the model:

(a) Kernel function:
Kernel function is formula that gives weights to the neighboring observations

Gwr.gauss

("d" is the distance, and "b" is the bandwidth)

Gauss weighted function gives weights to surrounding areas based on the exponential distance. This method will take all global observations into modelling.

Gwr.bisquare,




Bisquare kernel function will calculate the weights based on this formula. This method, unlike Gauss function, will only take the observations within the bandwidth. The outsiders will be excluded from modelling.







(b)Bandwidth type:
Bandwidth decides how many neighboring observations we should take when building model. Normally we will use "adaptive" bandwidth, where system will automatically generate an optimized bandwidth based on some algorithm.
Global bandwidth is also available, which takes all observations into the weighting function.

(c)Bandwidth optimization criteria:
It is the criteria to optimize bandwidth selection. By default, we use minimum AIC value. You can also choose cross validation as the bandwidth selection criteria.

Test Sample of Regression Modelling Application

At the beginning you may notice that there are several tabs on the top of the interface. The first tab is for raw data uploading. The second one is for initial data exploration and variable selection. The model results will be generated in the third tab.
The upload tab allows you to upload Chinese statistic data and explore the data structure.<br

Uploadcsv.png
Csvexplore.png