Difference between revisions of "Group09 Proposal"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(5 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{DISPLAYTITLE:<span style="position: absolute; clip: rect(1px 1px 1px 1px); clip: rect(1px, 1px, 1px, 1px);">{{FULLPAGENAME}}</span>}}
+
<div style="background:#EDEDED; padding:10px; border-top-left-radius:30px;border-top-right-radius:30px;text-align: center;">
 +
<gallery mode="packed-hover">
 +
    File:G9-title-bg.png|''[[Group09_Application]]''
 +
    </gallery>
 +
</div>
 +
{| style="font-family: Century Gothic, Arial; width="100%" cellspacing="0" cellpadding="0" valign="top" border="0" |
 +
| style="font-size:100%; background-color:#D5DBDB; text-align:center" | [[Group09_Overview | <font color="000000"><u>Overview</u></font>]]
  
== Background ==
+
| style="border-bottom:1px solid #bfbeba; background:none;" |
 +
| style="font-size:100%; background-color:#D5DBDB; text-align:center" | [[Group09_Proposal | <font color="000000"><u>Proposal</u></font>]]
 +
 
 +
| style="border-bottom:1px solid #bfbeba; background:none;" |
 +
| style="font-size:100%; background-color:#D5DBDB; text-align:center" | [[Group09_Poster | <font color="000000"><u>Poster</u></font>]]
 +
 
 +
| style="border-bottom:1px solid #bfbeba; background:none;" |
 +
| style="font-size:100%; background-color:#D5DBDB; text-align:center" | [[Group09_Application | <font color="000000"><u>Application</u></font>]]
 +
 
 +
| style="border-bottom:1px solid #bfbeba; background:none;" |
 +
| style="font-size:100%; background-color:#D5DBDB; text-align:center" | [[Group09_Report | <font color="000000"><u>Report</u></font>]]
 +
|}
 +
<br>
 +
== '''Background''' ==
 
<br />
 
<br />
Healthcare and diseases is the major studied subject over past centuries. The germ theory of disease in the 19th century led to cures for many infectious diseases, at the same time, public health measures developed as the rapid growth of cities required systematic sanitary measures. A century later, advanced research centers opened, and during the next several decades, it was characterized by new biological treatments, such as antibiotics. These advancements, along with developments in chemistry, genetics, and radiography led to modern medicine.<br /><br />
+
Healthcare and diseases are the major studied subject over past centuries. The germ theory of disease in the 19th century led to cures for many infectious diseases, at the same time, public health measures developed as the rapid growth of cities required systematic sanitary measures. A century later, advanced research centers opened, and during the next several decades, it was characterized by new biological treatments, such as antibiotics. These advancements, along with developments in chemistry, genetics, and radiography led to modern medicine. <br /><br />
The existence of discrete inheritable units was first suggested by Gregor Mendel (1822-1884). And advances in understanding genes and inheritance continued throughout the 20th century. Deoxyribonucleic acid (DNA) was shown to be the molecular repository of genetic information by experiments in the 1940s to 1950s.<br /><br />
+
Although disease patterns change constantly, communicable diseases remain the leading cause of mortality and morbidity in the least and less developed countries. Despite decades of economic growth and development in countries that belong to the World Health Organization (WHO) South-East Asia Region, most countries in this region still have a high burden of communicable diseases. This raises some urgent concerns. The first is that despite policies and interventions to prevent and control communicable diseases, most countries have failed to eradicate vaccine-preventable diseases. Second, sustainable financing to scale up interventions is lacking, especially for emerging and re-emerging diseases that can produce epidemics. Finally, in the present global economic and political context, it is important to understand how international aid agencies and donors prioritize their funding allocations for the prevention, control, and treatment of communicable diseases. Prioritization is especially critical if one accepts the global public good character of communicable diseases. <br /><br />
 +
 
 +
Contagious disease refers infectious disease which is transmitted to other persons, either by physical contact with the person suffering the disease or by casual contact with their secretions or objects touched by them or airborne route among other routes. While non-contagious diseases are hard to spread over by medical isolation. Most epidemics are caused by contagious diseases, with occasional exceptions, such as yellow fever, phthisis and so on. <br /><br />
 +
 
 +
 
 +
== '''Motivation''' ==
 +
 
 +
Diseases are prevalent no matter in which society, whilst, as the economic developing, healthcare becomes the major concern in daily life. Recently, there are still a lot of contagious diseases such as TB, malaria, cholera and meningitis, influenza A(H5N1) virus (avian flu), severe acute respiratory syndrome(SARS) and chikungunya reached epidemic proportions in some countries, especially for developing countries. Thus, we want to apply study and analyze historical records of seven contagious diseases: Smallpox, Rubella, Hepatitis, measles, polio, mumps, pertussis from US 1916-2010. It can help us to find out the historical pattern for some typical contagious diseases and apply to other diseases. <br /><br />
 +

Scientific methods align with a huge amount of reliable researches always come out with the convincing and inspiring result. we intend to use a large-scale biomedical literature database to construct a symptom-based human disease network and investigate the connections with related diseases.<br /><br />
 +


Nonetheless, this project also serves the following purposes: 1) Exploratory analysis of datasets; 2) Aiding domain experts to seek for unexpected association among diseases as well as validate their research results; 3) bridge the gap between knowledge obtained by experts only as well as produced at the lab bench and its use at the clinical bedside. <br /><br />
  
== Motivation ==
+
== '''Relevant researches''' ==
  
Diseases are prevalent no matter in which society, whilst, as the economic developing, healthcare becomes the major concern in daily life. Recently, the efforts of genetic and proteomic aspects get paid off, which has experienced remarkable advances of understanding. However, some diseases are too complex to study deeply so far, most aspects of the relations between genotype and phenotype still remain unclear.<br /><br />
+
Though previous researches have brought remarkable advances in the understanding of human diseases, polygenicity and pleiotropism are still the two major factors that are hampering the progress. In addition, the boundaries between diseases are diffusing, as they can have multiple causes and can be related cross dimensions, which slow the researches. <br /><br />
The complexity of diseases origins from the nature of genes, which involved genetic and molecular biology. Although there is a history to learn genes, the power and the complexity still exceed human beings imagination. With technology development, learning and understanding the entangled relationship between diseases is not as difficult as past with the help of modern theories and tools, a number of resources can be constructed aiming to find out the relationship among diseases without being a medical expert.<br /><br />
+


Constructing disease network is considered to be an effective way to understand the complex relationship between diseases. Previous work serves as evidence, for example, Hidalgo et al. constructed a disease phenotypic network using comorbidity patterns from more than 30 million Medicare patients, capturing disease progression patterns, such as that patients tend to develop diseases in the network vicinity of diseases that they already have and that patients with highly interconnected diseases show higher mortality. <br /><br />
Scientific methods align with a huge amount of reliable researches always come out with convincing and inspiring result. we intend to use a large-scale biomedical literature database to construct a symptom-based human disease network and investigate the connections.<br /><br />
+


Furthermore, Rzhetsky et al. inferred the comorbidity links between 161 disorders from the disease history of 1.5 million patients and proposed models to estimate the genetic overlap between diseases. <br /><br />

Moreover, symptoms are the most directly observable characteristics of a disease and the very basis of clinical disease classification. <br /><br />


Nonetheless, this project also serves following purposes: 1) Exploratory analysis of datasets; 2) Aiding domain experts seek for unexpected association among diseases as well as validate their research results; 3) bridge the gap between knowledge obtained by experts only as well as produced at the lab bench and its use at the clinical bedside. <br /><br />
+
Therefore, constructing a network of the connection between shared symptoms two diseases could be a useful practice to discover the associations between diseases, symptoms as well as disease and symptoms. <br /><br />
 +
Last but least, The US related database was digitized and standardized by a team at the University of Pittsburgh, including Professor Wilbert van Panhuis, MD, PhD, Professor John Grefenstette, PhD, and Dean Donald Burke, MD. <br /><br />
  
== Relevant researches ==
 
  
Though previous researches have brought remarkable advances in the understanding of human diseases, polygenicity and pleiotropism are still the two major factors that are hampering the progress. In addition, the boundaries between diseases are diffusing, as they can have multiple causes and can be related cross dimensions, which slow the researches.<br /><br />
+
== '''Dataset Overview''' ==
Constructing disease network is considered to be an effective way to understand the complex relationship between diseases. Previous work serves as evidence, for example, Hidalgo et al. constructed a disease phenotypic network using comorbidity patterns from more than 30 million Medicare patients, capturing disease progression patterns, such as that patients tend to develop diseases in the network vicinity of diseases that they already have and that patients with highly interconnected diseases show higher mortality. <br /><br />
 
Furthermore, Rzhetsky et al. inferred the comorbidity links between 161 disorders from the disease history of 1.5 million patients and proposed models to estimate the genetic overlap between diseases.<br /><br />
 
Moreover, symptoms are the most directly observable characteristics of a disease and the very basis of clinical disease classification. <br /><br />
 
Therefore, constructing a network of the connection between shared symptoms two diseases could be a useful practice to discover the associations between diseases, symptoms as well as disease and symptoms.<br /><br />
 
  
== Dataset Overview ==
+
1. The occurrence of disease terms and symptom terms and their tf-idf score<br />
 +

2. The symptom similarity score between paired diseases<br />
 +
3. US contagious diseases from 1916-2010<br />
  
1. Disease terms and occurrence<br />
 
2. Symptom terms and occurrence<br />
 
3. The occurrence of disease terms and symptom terms and their tf-idf score<br />
 
4. The symptom similarity score between paired diseases<br />
 
  
  
== Methodology ==  
+
== '''Methodology''' ==  
  
===== Similarity =====
+
===== '''Similarity''' =====
 
The disease similarity and symptom similarity are measured by their tf-idf score.
 
The disease similarity and symptom similarity are measured by their tf-idf score.
  
===== Centrality=====  
+
===== '''Centrality'''=====  
 
The centrality of a node / edge measures how central (or important) is a node or edge in the network.
 
The centrality of a node / edge measures how central (or important) is a node or edge in the network.
 
* betweenness centrality: The betweenness centrality for each nodes is the number of the shortest paths that pass through the nodes.
 
* betweenness centrality: The betweenness centrality for each nodes is the number of the shortest paths that pass through the nodes.
Line 41: Line 63:
 
We intend to try above centrality algorithms and choose the one with best result to construct our modularity detection.
 
We intend to try above centrality algorithms and choose the one with best result to construct our modularity detection.
  
===== Community detection =====
+
===== '''Community detection''' =====
 
To some extent, community is similar to clustering, because it consists of grouping nodes. But the different lies between clustering and community detection is that community detection is based on the graph topology.
 
To some extent, community is similar to clustering, because it consists of grouping nodes. But the different lies between clustering and community detection is that community detection is based on the graph topology.
  
  
  
== Visualization deliverables ==
+
== '''Visualization deliverables''' ==
 
The visualization deliverables consist of two sections:
 
The visualization deliverables consist of two sections:
 
# Exploratory section
 
# Exploratory section
 
# Analytical section
 
# Analytical section
  
== Relevant packages ==
+
== '''Relevant packages''' ==
Tentative R packages which are relevant to the scoop of the project:
+
Tentative R packages which are relevant to the scoop of the project:<br />
# Data preparation and exploratory analysis:
+
'''Data preparation and exploratory analysis:''' <br />
tidyverse : tidyverse is designed for data science and it consists of a set of R packages which are efficiently applicable for data wrangling and data visualization. Significantly, it keeps data manipulation in a consistency of tidy approaches.
+
'''''tidyverse''''' : tidyverse is designed for data science and it consists of a set of R packages which are efficiently applicable for data wrangling and data visualization. Significantly, it keeps data manipulation in a consistency of tidy approaches.
  
# Data preprocessing and visualization:
+
'''Data preprocessing and visualization:'''<br />
 
graphs and networks are prevalent data structures, though there exists efficient packages for network analysis, (e.g. the igraph package and network package), network data itself is not tidy, it can be envisioned as two tidy tables, one for node data and one for edge data. Moreover, the advantages of ggplot2 and dplyr can bring to data analysis and visualization are not sufficient when working on network analysis.
 
graphs and networks are prevalent data structures, though there exists efficient packages for network analysis, (e.g. the igraph package and network package), network data itself is not tidy, it can be envisioned as two tidy tables, one for node data and one for edge data. Moreover, the advantages of ggplot2 and dplyr can bring to data analysis and visualization are not sufficient when working on network analysis.
  
 
Due to above reasons, we choose making full use of ggraph and tidygraph to make the structure of our project in a tidy approach.
 
Due to above reasons, we choose making full use of ggraph and tidygraph to make the structure of our project in a tidy approach.
  
Ggraph: The implementation of ggraph is built on ggplot2, it implements a flexible approach to build graphs layer by layer, the same as ggplot2, on the contrast, it is more friendly to visualize network than ggplot2.
+
'''''Ggraph''''': The implementation of ggraph is built on ggplot2, it implements a flexible approach to build graphs layer by layer, the same as ggplot2, on the contrast, it is more friendly to visualize network than ggplot2.
 
 
Tidygraph: This package provides a tidy API for graph/network manipulation. The nature of network data itself is not tidy, however, in the tidygraph framework, network data is considered as two tidy data tables, one describing the node data and the other is for edge data. tidygraph provides a way to switch between the two tables and provides dplyr verbs for manipulating them. Furthermore it provides access to a lot of graph algorithms with return values that facilitate their use in a tidy workflow.
 
 
 
# Interactive visualization:
 
networkD3: this package allows implementing interactive network analysis easier, plots are not a static "as it is" data representation, but allow users to explore data points, hierarchies among the data, filter data by groups, and similar.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 +
'''''Tidygraph''''': This package provides a tidy API for graph/network manipulation. The nature of network data itself is not tidy, however, in the tidygraph framework, network data is considered as two tidy data tables, one describing the node data and the other is for edge data. tidygraph provides a way to switch between the two tables and provides dplyr verbs for manipulating them. Furthermore it provides access to a lot of graph algorithms with return values that facilitate their use in a tidy workflow.
  
== Further exploration ==
+
'''Interactive visualization:'''<br />
 +
'''''networkD3''''': this package allows implementing interactive network analysis easier, plots are not a static "as it is" data representation, but allow users to explore data points, hierarchies among the data, filter data by groups, and similar.
 +
'''Interactive visualization:'''<br />
 +
'''''Highchart''''': this package is a R wrapper for Highcharts javascript libray and its modules. Highcharts is very mature and flexible javascript charting library and it has a great and powerful API.

Latest revision as of 22:37, 9 August 2018

Overview Proposal Poster Application Report


Background


Healthcare and diseases are the major studied subject over past centuries. The germ theory of disease in the 19th century led to cures for many infectious diseases, at the same time, public health measures developed as the rapid growth of cities required systematic sanitary measures. A century later, advanced research centers opened, and during the next several decades, it was characterized by new biological treatments, such as antibiotics. These advancements, along with developments in chemistry, genetics, and radiography led to modern medicine.

Although disease patterns change constantly, communicable diseases remain the leading cause of mortality and morbidity in the least and less developed countries. Despite decades of economic growth and development in countries that belong to the World Health Organization (WHO) South-East Asia Region, most countries in this region still have a high burden of communicable diseases. This raises some urgent concerns. The first is that despite policies and interventions to prevent and control communicable diseases, most countries have failed to eradicate vaccine-preventable diseases. Second, sustainable financing to scale up interventions is lacking, especially for emerging and re-emerging diseases that can produce epidemics. Finally, in the present global economic and political context, it is important to understand how international aid agencies and donors prioritize their funding allocations for the prevention, control, and treatment of communicable diseases. Prioritization is especially critical if one accepts the global public good character of communicable diseases.

Contagious disease refers infectious disease which is transmitted to other persons, either by physical contact with the person suffering the disease or by casual contact with their secretions or objects touched by them or airborne route among other routes. While non-contagious diseases are hard to spread over by medical isolation. Most epidemics are caused by contagious diseases, with occasional exceptions, such as yellow fever, phthisis and so on.


Motivation

Diseases are prevalent no matter in which society, whilst, as the economic developing, healthcare becomes the major concern in daily life. Recently, there are still a lot of contagious diseases such as TB, malaria, cholera and meningitis, influenza A(H5N1) virus (avian flu), severe acute respiratory syndrome(SARS) and chikungunya reached epidemic proportions in some countries, especially for developing countries. Thus, we want to apply study and analyze historical records of seven contagious diseases: Smallpox, Rubella, Hepatitis, measles, polio, mumps, pertussis from US 1916-2010. It can help us to find out the historical pattern for some typical contagious diseases and apply to other diseases.


Scientific methods align with a huge amount of reliable researches always come out with the convincing and inspiring result. we intend to use a large-scale biomedical literature database to construct a symptom-based human disease network and investigate the connections with related diseases.



Nonetheless, this project also serves the following purposes: 1) Exploratory analysis of datasets; 2) Aiding domain experts to seek for unexpected association among diseases as well as validate their research results; 3) bridge the gap between knowledge obtained by experts only as well as produced at the lab bench and its use at the clinical bedside. 

Relevant researches

Though previous researches have brought remarkable advances in the understanding of human diseases, polygenicity and pleiotropism are still the two major factors that are hampering the progress. In addition, the boundaries between diseases are diffusing, as they can have multiple causes and can be related cross dimensions, which slow the researches.



Constructing disease network is considered to be an effective way to understand the complex relationship between diseases. Previous work serves as evidence, for example, Hidalgo et al. constructed a disease phenotypic network using comorbidity patterns from more than 30 million Medicare patients, capturing disease progression patterns, such as that patients tend to develop diseases in the network vicinity of diseases that they already have and that patients with highly interconnected diseases show higher mortality. 



Furthermore, Rzhetsky et al. inferred the comorbidity links between 161 disorders from the disease history of 1.5 million patients and proposed models to estimate the genetic overlap between diseases.



Moreover, symptoms are the most directly observable characteristics of a disease and the very basis of clinical disease classification. 



 Therefore, constructing a network of the connection between shared symptoms two diseases could be a useful practice to discover the associations between diseases, symptoms as well as disease and symptoms.

Last but least, The US related database was digitized and standardized by a team at the University of Pittsburgh, including Professor Wilbert van Panhuis, MD, PhD, Professor John Grefenstette, PhD, and Dean Donald Burke, MD.


Dataset Overview

1. The occurrence of disease terms and symptom terms and their tf-idf score

2. The symptom similarity score between paired diseases
3. US contagious diseases from 1916-2010


Methodology

Similarity

The disease similarity and symptom similarity are measured by their tf-idf score.

Centrality

The centrality of a node / edge measures how central (or important) is a node or edge in the network.

  • betweenness centrality: The betweenness centrality for each nodes is the number of the shortest paths that pass through the nodes.
  • closeness centrality: Closeness centrality measures how many steps is required to access every other nodes from a given nodes. It describes the distance of a node to all other nodes. The more central a node is, the closer it is to all other nodes.
  • eigenvector centrality: A node is important if it is linked to by other important nodes. The centrality of each node is proportional to the sum of the centralities of those nodes to which it is connected. In general, nodes with high eigenvector centralities are those which are linked to many other nodes which are, in turn, connected to many others (and so on).

We intend to try above centrality algorithms and choose the one with best result to construct our modularity detection.

Community detection

To some extent, community is similar to clustering, because it consists of grouping nodes. But the different lies between clustering and community detection is that community detection is based on the graph topology.


Visualization deliverables

The visualization deliverables consist of two sections:

  1. Exploratory section
  2. Analytical section

Relevant packages

Tentative R packages which are relevant to the scoop of the project:
Data preparation and exploratory analysis:
tidyverse : tidyverse is designed for data science and it consists of a set of R packages which are efficiently applicable for data wrangling and data visualization. Significantly, it keeps data manipulation in a consistency of tidy approaches.

Data preprocessing and visualization:
graphs and networks are prevalent data structures, though there exists efficient packages for network analysis, (e.g. the igraph package and network package), network data itself is not tidy, it can be envisioned as two tidy tables, one for node data and one for edge data. Moreover, the advantages of ggplot2 and dplyr can bring to data analysis and visualization are not sufficient when working on network analysis.

Due to above reasons, we choose making full use of ggraph and tidygraph to make the structure of our project in a tidy approach.

Ggraph: The implementation of ggraph is built on ggplot2, it implements a flexible approach to build graphs layer by layer, the same as ggplot2, on the contrast, it is more friendly to visualize network than ggplot2.

Tidygraph: This package provides a tidy API for graph/network manipulation. The nature of network data itself is not tidy, however, in the tidygraph framework, network data is considered as two tidy data tables, one describing the node data and the other is for edge data. tidygraph provides a way to switch between the two tables and provides dplyr verbs for manipulating them. Furthermore it provides access to a lot of graph algorithms with return values that facilitate their use in a tidy workflow.

Interactive visualization:
networkD3: this package allows implementing interactive network analysis easier, plots are not a static "as it is" data representation, but allow users to explore data points, hierarchies among the data, filter data by groups, and similar. Interactive visualization:
Highchart: this package is a R wrapper for Highcharts javascript libray and its modules. Highcharts is very mature and flexible javascript charting library and it has a great and powerful API.