Difference between revisions of "Group 3 Report"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 105: Line 105:
 
Next comes the explanatory analysis. We can focus on the coefficient displayed on the left, combined with the p-value on the right which indicates the significance level of each individual model.<br>
 
Next comes the explanatory analysis. We can focus on the coefficient displayed on the left, combined with the p-value on the right which indicates the significance level of each individual model.<br>
  
[[File:Gwr_Gwr_science.png|800px|thumbnail|center]]<br>
+
[[File:Gwr_science.png|800px|thumbnail|center]]<br>
 
This one is the effect of government scientific expenditure on local GDP. The east coast regions tend to have higher coefficients. The model significant level is relatively lower in the inland area, partially because the lack of training data from neighboring areas.<br>
 
This one is the effect of government scientific expenditure on local GDP. The east coast regions tend to have higher coefficients. The model significant level is relatively lower in the inland area, partially because the lack of training data from neighboring areas.<br>
 
[[File:Gwr_institution.png|800px|thumbnail|center]]<br>
 
[[File:Gwr_institution.png|800px|thumbnail|center]]<br>
 
When we research on the effect of higher institutions, it is surprising to find that central part has the lowest coefficient. The central part includes Shanghai, Jiangsu and Zhejiang province, which has the most universities in the whole area and also be well known for the education quality. Fudan University, Jiaotong University and Zhejiang University are among the top universities in China, and every year large number of students from other regions flow to this areas for high-level education. After graduation many of them will go back to their hometown and contribute to the GDP. So that’s the reason of the surrounding areas have a small number of institutions, but higher coefficients.<br>
 
When we research on the effect of higher institutions, it is surprising to find that central part has the lowest coefficient. The central part includes Shanghai, Jiangsu and Zhejiang province, which has the most universities in the whole area and also be well known for the education quality. Fudan University, Jiaotong University and Zhejiang University are among the top universities in China, and every year large number of students from other regions flow to this areas for high-level education. After graduation many of them will go back to their hometown and contribute to the GDP. So that’s the reason of the surrounding areas have a small number of institutions, but higher coefficients.<br>
 
So above is a test case to show how to use the interactive application for building spatial regression model. The advantage of the model is that it provides the parameter estimates with statistical figures like R2 and p value. There are more than 20 variables available in model, many selection scenarios can be the model input. This interactive interface allows user to view their variable selection scenarios and balance their business requirements with model significant level. <br>
 
So above is a test case to show how to use the interactive application for building spatial regression model. The advantage of the model is that it provides the parameter estimates with statistical figures like R2 and p value. There are more than 20 variables available in model, many selection scenarios can be the model input. This interactive interface allows user to view their variable selection scenarios and balance their business requirements with model significant level. <br>

Revision as of 23:17, 2 December 2017

Background

Imbalance of economic development has become a long-lasting issue in China. Benefited by the geographic location as well as the national policy deployed in 1980's, east coast areas in China, especially Shanghai, Zhejiang and Jiangsu have grown at an incredible speed during last few decades. The economic growth in east China shows a geographic radiation pattern, and contributors for GDP are different in every area.
In the project, we will use regression model and focus on researching the different GDP indicators in east China.We will use R to build an interaction application so that users could feel easy to explore their interested economic contributors.

Data description

The data set we are using includes 2 parts:
1. GDP and indicators data
We have downloaded the statistic data of 78 regions in China.The variables includes GDP volume (including total GDP and GDP for each industry ) and more than 20 variables that we think might be potentially influence the GDP volume.
2. Shape files
The shapefile of Chinese region (CHN_adm_shp) is available on ERSI (which is an organization providing geographic information system). The shapefile includes 3 levels. In our project, we are using the 2nd level of the shapefile ("prefecture-level city").

Analysis flow

In our research, both linear regression model and geo-weighted regression model will be used in analyzing the effects of each indicators.
The analysis flow basically includes 3 parts:
1. Data exploration and variable selection
Includes variable correlation and distribution matrix which enables users to exclude those highly-related variables in the regression model. Par-coordinate chart is also provide for users to have a general impression of these correlations.
2. Modelling and visualization
In the out of the regression model, we will display the parameter estimates of each variable, as well as its significant level, which is calculated by the p value.
3. Data analysis
We will use the interface to analyze the different effects from selected indicators.

Tools used

The application is built using R language. The advantage of R is its ability in analysis. This would be beneficial especially in the future model evaluation. R also provides an interactive application function which is helpful to visualize the analysis process.
Listed below are the R packages we used in building the application:
Rgdal package
Rgdal is a widely used package for reading shapefiles. The shapefile we are using is “CHN_adm_shp”, which is provided by ESRI. The selected level is “CHN_adm2”, meaning prefecture level region.

Package rmapshaper
The ms_simplify function allows us to simplify the outline of the shapefile. The simplified shapefile will largely improve the processing efficiency when we are building statistic maps.

GWmodel package
Package Gwmodel is an R package which allows users to realize spatial data analysis. The package provides GW summary statistics, GW principle component analysis, GW discriminant analysis and various GW regression models [--R GW model documentation]. The advantage is that it integrates parameter estimates with adjust test results which gives p value as significance indicator of the model.

Leaflet package
This is the R version of leaflet.js, which is an interactive map interface using htmlwidgets enabling users to adjust and interact with the map objects.

Shiny package
Shiny package provides a reactive web application so that users can manipulate the data and focus on the analysis.

Model Explanation

1.Traditional linear regression

When applying traditional linear regression model, each location is treated as an observation. The regression line line is optimized by least R2 value, and will return coefficients and intercept value for each variable.
When running ordinary linear regression model, we should notice these points:
(1) There is only one global regression model returned for the whole area, meaning that every region is sharing the same parameter estimate. It is not quite reliable when analyzing the area with different economy patterns.
(2) Linear regression fails to take mutual effects of neighboring places into consideration. Each observation is treated equally. But in real cases, the mutual effects between neighboring regions is one of the major contributors to local economy. We should also consider about this effect in our model.

2. Geographic weighted regression:

Weighted regression is a methodology that takes the nearing observations of each city into the model-building. Like for city A, its nearing areas will be given a weighted value, and all these selected observations will be input in building the individual model for observation x1.
Geographic weighted regression is a methodology of explaining geo-spatial static patterns. As its name suggests, it is a sub-category of weighted regression. Unlike weighted linear regression (where the distance is based on the x value), the “distance” of geo-weighted regression is based on the actual geographic distance.
When running the geographic regression model, besides a global regression model (just as the linear regression), individual regression model will also be built for each observation. The individual model is based on the observation and its neighboring areas within the “distance”, and for neighboring areas, weighted values will be given based on how far away from this observation.
For geographic regression, there are several parameters to modify the model:

(a) Kernel function:
Kernel function is formula that gives weights to the neighboring observations

Gwr.gauss

("d" is the distance, and "b" is the bandwidth)

Gauss weighted function gives weights to surrounding areas based on the exponential distance. This method will take all global observations into modelling.

Gwr.bisquare,




Bisquare kernel function will calculate the weights based on this formula. This method, unlike Gauss function, will only take the observations within the bandwidth. The outsiders will be excluded from modelling.







(b)Bandwidth type:
Bandwidth decides how many neighboring observations we should take when building model. Normally we will use "adaptive" bandwidth, where system will automatically generate an optimized bandwidth based on some algorithm.
Global bandwidth is also available, which takes all observations into the weighting function.

(c)Bandwidth optimization criteria:
It is the criteria to optimize bandwidth selection. By default, we use minimum AIC value. You can also choose cross validation as the bandwidth selection criteria.

Test Sample of Regression Modelling Application

At the beginning you may notice that there are several tabs on the top of the interface. The first tab is for raw data uploading. The second one is for initial data exploration and variable selection. The model results will be generated in the third tab.
The upload tab allows you to upload Chinese statistic data and explore the data structure.

Uploadcsv.png


Csvexplore.png


After loading the data, first, we need to decide the variables that you want to input in the regression model. First of all is to select all the variables that you are interested in, and then explore their individual distributions and correlations with each other. In the example, I am interested in the following 6 variables: (1) retail sales of customer goods (2) government science expenditure (3) Government fixed assets investment (4) Foreign direct investment (5) City construction rate (6) number of higher institutions

Variable comparison.png


As the correlation & distribution matrix suggests, some pairs show high correlations, which means we need to exclude one of them in the model.

High correlation.png


Correlation between “government science expenditure” and “foreign direct investment” is extremely high. In this case, I will remove “education expenditure”.

Model input.png


In this case, first I want to choose the global linear regression model because I want to have a general impression of the different effects of selected factors. Then selected variables will be input to the backed to run the regression model. Few second later, you will get the result in the “Modelling” tab.


OLS result.png


(The leaflet map shows the residual distribution of each area, while the table is the regression results.) Then I found that for variables “City construction rate”, the p-value is relatively higher, meaning that this variable is less significant compared with others. So, we will come back and remove this variable and run the model again. (The variable selection is a stepwise process, users can keep doing this until the p values are at an acceptable level.)

OLS result2.png


OK, now we got the linear model. The p-values seems quite nice. It shows negative effects of institution number and positive effect of retail sales on GDP. The R-squared value is 97%, meaning that 97% of GDP could be explained by this model. But if we look very carefully to the table, we might notice that many parameter estimates are to zero, while the intercept is quite big. Moreover, if we check the residual distribution on the left side, the distribution shows kind of “clustered patterns” (like in northeast part, residuals tend to be lower, while in southwest part, the residual is very close to 0), which means the estimate error is highly related with its geo-graphic location, instead of randomly distributed as we expected. Global linear model fails to take the geographic relationship into the model. So we will turn to geo-weighted model. Choose model in the variable selection tab:

Model selection.png


Then check the model results in the modelling tab:

Gwr output.png


The left map is colored by the coefficient of selected variable. The right one shows the p-value of the coefficient. You can also change the variable that you want to display on the leaflet map, by using the dropdown list on the bottom-left.

Gwr output2.png


First, let’s take a look at the overall model performance:

Gwr residual.png


The R-square value from GWR is 0.988, slightly better than the OLS model. The residual distribution also seems to be more geographically random. Next comes the explanatory analysis. We can focus on the coefficient displayed on the left, combined with the p-value on the right which indicates the significance level of each individual model.

Gwr science.png


This one is the effect of government scientific expenditure on local GDP. The east coast regions tend to have higher coefficients. The model significant level is relatively lower in the inland area, partially because the lack of training data from neighboring areas.

Gwr institution.png


When we research on the effect of higher institutions, it is surprising to find that central part has the lowest coefficient. The central part includes Shanghai, Jiangsu and Zhejiang province, which has the most universities in the whole area and also be well known for the education quality. Fudan University, Jiaotong University and Zhejiang University are among the top universities in China, and every year large number of students from other regions flow to this areas for high-level education. After graduation many of them will go back to their hometown and contribute to the GDP. So that’s the reason of the surrounding areas have a small number of institutions, but higher coefficients.
So above is a test case to show how to use the interactive application for building spatial regression model. The advantage of the model is that it provides the parameter estimates with statistical figures like R2 and p value. There are more than 20 variables available in model, many selection scenarios can be the model input. This interactive interface allows user to view their variable selection scenarios and balance their business requirements with model significant level.