Difference between revisions of "Group09 proposal"

From ISSS608-Visual Analytics and Applications
Jump to navigation Jump to search
 
(47 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
[[File:Poor.png|900px|frameless|center]]
 +
<div>
 +
{|style="background-color:#607080;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
 +
| style="font-family:Century Gothic; font-size:100%; solid #c0c0c0; background:#a9a9a9; text-align:center;" width="25%" |
 +
;
 +
[[Group09_proposal| <font color="#FFFFF">Proposal</font>]]
 +
 +
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="25%" |
 +
;
 +
[[Group09_poster| <font color="#FFFFFF">Poster</font>]]
 +
 +
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="25%" |
 +
;
 +
[[Group09_application| <font color="#FFFFFF">Application</font>]]
 +
 +
| style="font-family:Century Gothic; font-size:100%; solid #103080; background:#607080; text-align:center;" width="25%" |
 +
;
 +
[[Group09_research_paper| <font color="#FFFFFF">Research Paper</font>]]
 +
 +
|}
 +
</div>
 +
<br>
 +
 
== Abstract ==
 
== Abstract ==
  
== Objective ==
+
Income equility is a hot issue these years,many sub-topics have been derived from this theme.Scholars have studied about the rootcause that result in this situation,the post-impact of income inequility and so on.Our project is to explore the relationship between individual income inequility and health.It will be a extension study of a existing work,published by  Bakkeli, N., ''Income inequality and health in China: A panel data analysis.'' <br>
 +
 
 +
CHNS(China Health and Nutrition Survey) offer good resource for us to explore the income inequility problem in China,with more than 10 years data.Beyond original work scope,we will involve updated survey data to show the result.Besides,China context-income inequility gap among regions,will also be taken into consideration when conducting the analysis.The final result will be presented in a interactive visualization view.<br>
  
 
== Data Source ==
 
== Data Source ==
 +
This data from the ' China Health and Nutrition Survey '. [https://www.cpc.unc.edu/projects/china Click here to see the data.]<br>
 +
 +
===About Data Set===
 +
The survey took place over a 7-day period using a multistage, random cluster process to draw a sample of about 7,200 households with over 30,000 individuals in 15 provinces and municipal cities that vary substantially in geography, economic development, public resources, and health indicators. In addition, detailed community data were collected in surveys of food markets, health facilities, family planning officials, and other social services and community leaders.<br>
 +
 +
===Variables===
 +
(Dependent variable) Health <br>
 +
(Independent variable) Income inequality, Income Variables, Individual controls, Occupation and Sector <br>
 +
The table below shows the description of main variables that we will be using for our analysis:
 +
{| class="wikitable"
 +
|-
 +
! Variable !! Description
 +
|-
 +
| <b><i>Health</i></b>||
 +
|-
 +
| <b><i>Disease History</i></b>|| Binary variable. Defined as 0 if individual has none of chronic disease or cancer. 1 means has medical history.
 +
|-
 +
| <b><i>Income inequality</i></b>||
 +
|-
 +
| <b><i>Gini</i></b>|| The Gini-coefficient in the county level, sensitive to changes at middle income levels.
 +
|-
 +
| <b><i>Theil L</i></b>|| The mean logarithm deviation of the Generalized Entropy (Theil) indices, which is sensitive to changes at the bottom income levels.
 +
|-
 +
| <b><i>Theil T</i></b>|| The Theil index and is sensitive to changes in upper income levels.
 +
|-
 +
| <b><i>Income variables</i></b>||
 +
|-
 +
| <b><i>Individual income</i></b>|| The sum of each individual's income source, by adding up all individual income and revenue, minus individual expenditures. Household subsidies and other income that cannot be allocated to individuals in the household are not considered as a part of individual income.
 +
|-
 +
| <b><i>Province mean income(ind.)</i></b>|| Captures the degree of economic development in a province-level unit, calculated by averaging individual income in a province for all observations in the CHNS. “Ind” refer to individual.
 +
|-
 +
| <b><i>Household income</i></b>|| The sum of all individual incomes in a household.
 +
|-
 +
| <b><i>Province mean income(hh.)</i></b>|| Calculated by averaging household income in a province for all observation in the CHNS. “hh” refer to household.
 +
|-
 +
| <b><i>Individual controls</i></b>||
 +
|-
 +
| <b><i>Age</i></b>|| The age of respondent.
 +
|-
 +
| <b><i>Gender</i></b>|| Binary variable. Defined male as 0 and female as 1.
 +
|-
 +
| <b><i>Marriage Status</i></b>|| Binary variable. Married as 1, unmarried as 0.
 +
|-
 +
| <b><i>Majority</i></b>|| If the nationality of object is Han, then defined as “1”, else 0.
 +
|-
 +
| <b><i>Years of education</i></b>|| Calculated from the beginning of primary school, 6 years of primary school graduation, 9 years of junior high school graduation, 12 years of high school graduation, and 16 years of university graduation.
 +
|-
 +
| <b><i>urban</i></b>|| Binary variable. If respondent holds urban household registration then defined as 1, else 0. 
 +
|-
 +
| <b><i>Occupation</i></b>||
 +
|-
 +
| <b><i>Services class</i></b>|| Includes “senior professional/technical”, “administrator/executive/manager” and “army officer/police officer”.
 +
|-
 +
| <b><i>Non-manual worker</i></b>|| Includes “junior professional technical” and “office staff”.
 +
|-
 +
| <b><i>Skilled worker/supervisor</i></b>|| Includes “skilled worker” and “ordinary soldier, policeman”, “driver” and “athlete, actor, musician”.
 +
|-
 +
| <b><i>Semi-/non-skilled worker</i></b>|| Includes “non-skilled worker” and “service worker”.
 +
|-
 +
| <b><i>Farmer</i></b>|| As originally defined by CHNS data.
 +
|-
 +
| <b><i>Others</i></b>|| The rest of original occupation covered by CHNS data.
 +
|-
 +
| <b><i>Sector</i></b>||
 +
|-
 +
| <b><i>State</i></b>|| Includes “government”, “state service/institute” and “state-owned enterprise”.
 +
|-
 +
| <b><i>Collective</i></b>|| Includes “small collective enterprise” and “large collective enterprise”.
 +
|-
 +
| <b><i>Family farming</i></b>|| As original variable “family contract farming” of CHNS data.
 +
|-
 +
| <b><i>Individual enterprise</i></b>|| As variable “private, individual enterprise”, which originally defined by CHNS
 +
|-
 +
| <b><i>Private three-cap Enterpr.</i></b>|| The same as “three- capital enterprise” in CHNS data.
 +
|-
 +
| <b><i>Others</i></b>|| Includes “unknown” data in CHNS.
 +
|}
  
 
== Visualization ==
 
== Visualization ==
# Time series: The‘highcharter',‘plotly',‘viridis',‘scatterplot3' and‘ggplot2’packages will be employed to create two interactive time series line charts, revealing how the historical and future global CO2 emission and temperature change during different period.
+
'''Exploratory Data Analysis''' <br>
# World map: Same packages as mentioned above will be applied to generate two interactive world maps, showing and comparing the CO2 emission in different country and year respectively.
+
EDA is a common approach to summarize main characteristics of data, often with visual methods. In the first part of our dashboard, we use chats to tell the reason why we want to test the relationship between economic development and health risks in China, and to tell information about the data used in this project beyond the formal modeling or hypothesis testing task.
 +
 
 +
[[File:BMI.png|300px|frame|none|Example of bar chart to show China health risk variables we used in this project. ]]
 +
 
 +
After using Linear Probability Model to examine the relation between Health Risks and Economic Development, we use Correlation Matrix Plots to show the result of our final hypothesis.
 +
 
 +
[[File:Correlation Chart.png|300px|thumb|none|Example of correlation matrix plots to show is there any relation between Economic Development and Health Risks in China.]]
  
 
== Methodology ==
 
== Methodology ==
[[File:Process.png|right|thumb]]
+
We aim to provide an interactive dashboard that help to visualise descriptive and predictive statistics models for the relationship between Health Risks and Economic Development in China. This helps users clarify the effectiveness of different economy indexes on Rural vs Urban, Male vs Female health condition. This web app can then be deployed and used by others to help them explore data intelligently and help understand the relationship between the above variables. <br>
Three different approaches will be utilized to predict the global CO2 emissions and temperature in the next 10 years: <br>
+
 
 +
===Hypothesis===
 +
We hypothesize that greater economic development increases health risks. There are enough evidences to support that developing of Chinese economy brings great pressure which increases the rate of unhealthy behaviours such as smoking and alcohol abuse, and notably high occurrences of chronic diseases such as hypertension, heart disease, and diabetes. Therefore, nationals have higher health risks. <br>
  
# Holt exponential smoothing: By applying this approach, consequently each relevant variables’ (e.g. gas fuel, liquid fuel and solid fuel) future value will be obtained. And we can use them to predict the future CO2 emission by employing the linear regression model.
+
===Model===
# SARIMA: Seasonal Autoregressive Integrated Moving Average (SARIMA) model, an extension of ARIMA that explicitly supports univariate time series data with a seasonal component will be applied to conduct the prediction. We can use it gain the annual CO2 emission in the future with a lower and upper bound. 
+
Linear Probability Model (LPM) is a popular model used in social sciences research and we use it to examine our hypothesis, is there any relationship between Health Risks and Economy Developments in China. LPM is a regression model where the target or dependent variable is a binary variable and the independent variables can be binary and continuous. In our project, we use equation similar to following one to test hypothesis. <br>
# Auto-Regression: The Auto-Regression model describes the relationship between current values and the historical values. And it uses the historical time data as the variable to predict its future value. The factors that influence the CO2 emission, such as solid fuel and gas fuel, can be predicted by Auto-Regression model. As a result, the future global CO2 emission will be predicted by employing the linear regression model. <br><br>
 
  
After completing all prediction methods mentioned above, we intend to compare their result respectively with the actual CO2 emission in recent years as an evaluation and determine which of them is the best fit one.
+
===Sensitivity Test===
 +
After using LPM on our hypothesis, we need to finish robustness to test model stability. We use two ways. First is that we replace our original economy index with the other one and see is there any difference on results. Second, we use household based index rather than individual base index and see the results.
  
 
== Critics of Existing Works ==
 
== Critics of Existing Works ==
 +
 +
Our project is an extension of this research published by Bakkeli in 2016.Table below shows a summary about this work.<br>
 +
[https://doi.org/10.1016/j.socscimed.2016.03.041 Bakkeli, N. (2016). Income inequality and health in China: A panel data analysis.Social Science & Medicine, 157, 39–47.  ]
 +
 +
{| class="wikitable"
 +
|-
 +
! Critics !! Detail
 +
|-
 +
| <b><i>Lack of visualization</i></b>|| This essay is mainly focus on modelling and statistical analysis. Only present the results with several line graphs, with kinds of graph increased and interactive function added.[[File:Critic.jpg|600px|frameless|centre]]
 +
|-
 +
| <b><i>Appropriate selection of variables </i></b>|| Appropriate selection of variables
 +
Different measures are available to evaluate income equality and health. This essay selects physical health factors to avoid any bias raised by difference between education, gender, etc. Also the author assessed income inequality in multiple aspects, Gini index, household income and individual income. These variables are good to use.
 +
|-
 +
| <b><i>Eliminate the income gap among regions in China</i></b>||In China context, evident economic development gap exists  among different provinces, for example, Xinjiang province verses Jiangsu province. This is one aspect that the paper can be improved. Use geometrical analysis to present different results impacted by geography issue.
 +
|}
 +
 +
 +
==Application Libraries & Packages==
 +
{| class="wikitable"
 +
|-
 +
! Package Name !! Descriptions
 +
|-
 +
| shiny|| Interactive web applications for data visualization
 +
|-
 +
| ggplot2 || High-quality graphs
 +
|-
 +
| ggstatsplot ||ggplot2' Based Plots with Statistical Details
 +
|-
 +
|gglorenz||Plotting Lorenz Curve with the Blessing of 'ggplot2'
 +
|-
 +
| Tidyverse: tidyr, dplyr, ggplot2 || Tidying and manipulating data for visualizing in ggplot2
 +
|-
 +
| shinythemes || Apply themes to Shiny applications
 +
|-
 +
| ExPanDaR || Use for panel data modelling
 +
|-
 +
| corrplot || Create correlation matrix
 +
|-
 +
| plotly ||  Create interactive Web Graphics
 +
|}
  
 
== Team Members ==
 
== Team Members ==
Line 27: Line 177:
  
 
== References ==
 
== References ==
 +
* [https://doi.org/10.1093/geronb/gbr050 Does Self-reported Health Bias the Measurement of Health Inequalities in U.S. Adults? Evidence Using Anchoring Vignettes From the Health and Retirement Study.]
 +
* [https://doi.org/10.1016/j.socscimed.2016.03.041 Income inequality and health in China: A panel data analysis.]
 +
* [https://www.econometrics-with-r.org/11-1-binary-dependent-variables-and-the-linear-probability-model.html Using LPM with R]

Latest revision as of 06:23, 26 April 2020

Poor.png

Proposal

Poster

Application

Research Paper


Abstract

Income equility is a hot issue these years,many sub-topics have been derived from this theme.Scholars have studied about the rootcause that result in this situation,the post-impact of income inequility and so on.Our project is to explore the relationship between individual income inequility and health.It will be a extension study of a existing work,published by Bakkeli, N., Income inequality and health in China: A panel data analysis.

CHNS(China Health and Nutrition Survey) offer good resource for us to explore the income inequility problem in China,with more than 10 years data.Beyond original work scope,we will involve updated survey data to show the result.Besides,China context-income inequility gap among regions,will also be taken into consideration when conducting the analysis.The final result will be presented in a interactive visualization view.

Data Source

This data from the ' China Health and Nutrition Survey '. Click here to see the data.

About Data Set

The survey took place over a 7-day period using a multistage, random cluster process to draw a sample of about 7,200 households with over 30,000 individuals in 15 provinces and municipal cities that vary substantially in geography, economic development, public resources, and health indicators. In addition, detailed community data were collected in surveys of food markets, health facilities, family planning officials, and other social services and community leaders.

Variables

(Dependent variable) Health
(Independent variable) Income inequality, Income Variables, Individual controls, Occupation and Sector
The table below shows the description of main variables that we will be using for our analysis:

Variable Description
Health
Disease History Binary variable. Defined as 0 if individual has none of chronic disease or cancer. 1 means has medical history.
Income inequality
Gini The Gini-coefficient in the county level, sensitive to changes at middle income levels.
Theil L The mean logarithm deviation of the Generalized Entropy (Theil) indices, which is sensitive to changes at the bottom income levels.
Theil T The Theil index and is sensitive to changes in upper income levels.
Income variables
Individual income The sum of each individual's income source, by adding up all individual income and revenue, minus individual expenditures. Household subsidies and other income that cannot be allocated to individuals in the household are not considered as a part of individual income.
Province mean income(ind.) Captures the degree of economic development in a province-level unit, calculated by averaging individual income in a province for all observations in the CHNS. “Ind” refer to individual.
Household income The sum of all individual incomes in a household.
Province mean income(hh.) Calculated by averaging household income in a province for all observation in the CHNS. “hh” refer to household.
Individual controls
Age The age of respondent.
Gender Binary variable. Defined male as 0 and female as 1.
Marriage Status Binary variable. Married as 1, unmarried as 0.
Majority If the nationality of object is Han, then defined as “1”, else 0.
Years of education Calculated from the beginning of primary school, 6 years of primary school graduation, 9 years of junior high school graduation, 12 years of high school graduation, and 16 years of university graduation.
urban Binary variable. If respondent holds urban household registration then defined as 1, else 0.
Occupation
Services class Includes “senior professional/technical”, “administrator/executive/manager” and “army officer/police officer”.
Non-manual worker Includes “junior professional technical” and “office staff”.
Skilled worker/supervisor Includes “skilled worker” and “ordinary soldier, policeman”, “driver” and “athlete, actor, musician”.
Semi-/non-skilled worker Includes “non-skilled worker” and “service worker”.
Farmer As originally defined by CHNS data.
Others The rest of original occupation covered by CHNS data.
Sector
State Includes “government”, “state service/institute” and “state-owned enterprise”.
Collective Includes “small collective enterprise” and “large collective enterprise”.
Family farming As original variable “family contract farming” of CHNS data.
Individual enterprise As variable “private, individual enterprise”, which originally defined by CHNS
Private three-cap Enterpr. The same as “three- capital enterprise” in CHNS data.
Others Includes “unknown” data in CHNS.

Visualization

Exploratory Data Analysis
EDA is a common approach to summarize main characteristics of data, often with visual methods. In the first part of our dashboard, we use chats to tell the reason why we want to test the relationship between economic development and health risks in China, and to tell information about the data used in this project beyond the formal modeling or hypothesis testing task.

Example of bar chart to show China health risk variables we used in this project.

After using Linear Probability Model to examine the relation between Health Risks and Economic Development, we use Correlation Matrix Plots to show the result of our final hypothesis.

Example of correlation matrix plots to show is there any relation between Economic Development and Health Risks in China.

Methodology

We aim to provide an interactive dashboard that help to visualise descriptive and predictive statistics models for the relationship between Health Risks and Economic Development in China. This helps users clarify the effectiveness of different economy indexes on Rural vs Urban, Male vs Female health condition. This web app can then be deployed and used by others to help them explore data intelligently and help understand the relationship between the above variables. 

Hypothesis

We hypothesize that greater economic development increases health risks. There are enough evidences to support that developing of Chinese economy brings great pressure which increases the rate of unhealthy behaviours such as smoking and alcohol abuse, and notably high occurrences of chronic diseases such as hypertension, heart disease, and diabetes. Therefore, nationals have higher health risks.

Model

Linear Probability Model (LPM) is a popular model used in social sciences research and we use it to examine our hypothesis, is there any relationship between Health Risks and Economy Developments in China. LPM is a regression model where the target or dependent variable is a binary variable and the independent variables can be binary and continuous. In our project, we use equation similar to following one to test hypothesis.

Sensitivity Test

After using LPM on our hypothesis, we need to finish robustness to test model stability. We use two ways. First is that we replace our original economy index with the other one and see is there any difference on results. Second, we use household based index rather than individual base index and see the results.

Critics of Existing Works

Our project is an extension of this research published by Bakkeli in 2016.Table below shows a summary about this work.
Bakkeli, N. (2016). Income inequality and health in China: A panel data analysis.Social Science & Medicine, 157, 39–47.

Critics Detail
Lack of visualization This essay is mainly focus on modelling and statistical analysis. Only present the results with several line graphs, with kinds of graph increased and interactive function added.
Critic.jpg
Appropriate selection of variables Appropriate selection of variables

Different measures are available to evaluate income equality and health. This essay selects physical health factors to avoid any bias raised by difference between education, gender, etc. Also the author assessed income inequality in multiple aspects, Gini index, household income and individual income. These variables are good to use.

Eliminate the income gap among regions in China In China context, evident economic development gap exists among different provinces, for example, Xinjiang province verses Jiangsu province. This is one aspect that the paper can be improved. Use geometrical analysis to present different results impacted by geography issue.


Application Libraries & Packages

Package Name Descriptions
shiny Interactive web applications for data visualization
ggplot2 High-quality graphs
ggstatsplot ggplot2' Based Plots with Statistical Details
gglorenz Plotting Lorenz Curve with the Blessing of 'ggplot2'
Tidyverse: tidyr, dplyr, ggplot2 Tidying and manipulating data for visualizing in ggplot2
shinythemes Apply themes to Shiny applications
ExPanDaR Use for panel data modelling
corrplot Create correlation matrix
plotly Create interactive Web Graphics

Team Members

References