Group09 Proposal

Background

Healthcare and diseases is the major studied subject over past centuries. The germ theory of disease in the 19th century led to cures for many infectious diseases, at the same time, public health measures developed as the rapid growth of cities required systematic sanitary measures. A century later, advanced research centers opened, and during the next several decades, it was characterized by new biological treatments, such as antibiotics. These advancements, along with developments in chemistry, genetics, and radiography led to modern medicine.

The existence of discrete inheritable units was first suggested by Gregor Mendel (1822-1884). And advances in understanding genes and inheritance continued throughout the 20th century. Deoxyribonucleic acid (DNA) was shown to be the molecular repository of genetic information by experiments in the 1940s to 1950s.

Motivation

Diseases are prevalent no matter in which society, whilst, as the economic developing, healthcare becomes the major concern in daily life. Recently, the efforts of genetic and proteomic aspects get paid off, which has experienced remarkable advances of understanding. However, some diseases are too complex to study deeply so far, most aspects of the relations between genotype and phenotype still remain unclear.

The complexity of diseases origins from the nature of genes, which involved genetic and molecular biology. Although there is a history to learn genes, the power and the complexity still exceed human beings imagination. With technology development, learning and understanding the entangled relationship between diseases is not as difficult as past with the help of modern theories and tools, a number of resources can be constructed aiming to find out the relationship among diseases without being a medical expert.

Scientific methods align with a huge amount of reliable researches always come out with convincing and inspiring result. we intend to use a large-scale biomedical literature database to construct a symptom-based human disease network and investigate the connections.

Nonetheless, this project also serves following purposes: 1) Exploratory analysis of datasets; 2) Aiding domain experts seek for unexpected association among diseases as well as validate their research results; 3) bridge the gap between knowledge obtained by experts only as well as produced at the lab bench and its use at the clinical bedside.

Relevant researches

Though previous researches have brought remarkable advances in the understanding of human diseases, polygenicity and pleiotropism are still the two major factors that are hampering the progress. In addition, the boundaries between diseases are diffusing, as they can have multiple causes and can be related cross dimensions, which slow the researches.

Constructing disease network is considered to be an effective way to understand the complex relationship between diseases. Previous work serves as evidence, for example, Hidalgo et al. constructed a disease phenotypic network using comorbidity patterns from more than 30 million Medicare patients, capturing disease progression patterns, such as that patients tend to develop diseases in the network vicinity of diseases that they already have and that patients with highly interconnected diseases show higher mortality.

Furthermore, Rzhetsky et al. inferred the comorbidity links between 161 disorders from the disease history of 1.5 million patients and proposed models to estimate the genetic overlap between diseases.

Moreover, symptoms are the most directly observable characteristics of a disease and the very basis of clinical disease classification.

Therefore, constructing a network of the connection between shared symptoms two diseases could be a useful practice to discover the associations between diseases, symptoms as well as disease and symptoms.

Dataset Overview

1. Disease terms and occurrence
2. Symptom terms and occurrence
3. The occurrence of disease terms and symptom terms and their tf-idf score
4. The symptom similarity score between paired diseases

Methodology

Similarity

The disease similarity and symptom similarity are measured by their tf-idf score.

Centrality

The centrality of a node / edge measures how central (or important) is a node or edge in the network.

betweenness centrality: The betweenness centrality for each nodes is the number of the shortest paths that pass through the nodes.
closeness centrality: Closeness centrality measures how many steps is required to access every other nodes from a given nodes. It describes the distance of a node to all other nodes. The more central a node is, the closer it is to all other nodes.
eigenvector centrality: A node is important if it is linked to by other important nodes. The centrality of each node is proportional to the sum of the centralities of those nodes to which it is connected. In general, nodes with high eigenvector centralities are those which are linked to many other nodes which are, in turn, connected to many others (and so on).

We intend to try above centrality algorithms and choose the one with best result to construct our modularity detection.

Community detection

To some extent, community is similar to clustering, because it consists of grouping nodes. But the different lies between clustering and community detection is that community detection is based on the graph topology.

Visualization deliverables

The visualization deliverables consist of two sections:

Exploratory section
Analytical section

Relevant packages

Tentative R packages which are relevant to the scoop of the project:

Data preparation and exploratory analysis:

tidyverse : tidyverse is designed for data science and it consists of a set of R packages which are efficiently applicable for data wrangling and data visualization. Significantly, it keeps data manipulation in a consistency of tidy approaches.

Data preprocessing and visualization:

graphs and networks are prevalent data structures, though there exists efficient packages for network analysis, (e.g. the igraph package and network package), network data itself is not tidy, it can be envisioned as two tidy tables, one for node data and one for edge data. Moreover, the advantages of ggplot2 and dplyr can bring to data analysis and visualization are not sufficient when working on network analysis.

Due to above reasons, we choose making full use of ggraph and tidygraph to make the structure of our project in a tidy approach.

Ggraph: The implementation of ggraph is built on ggplot2, it implements a flexible approach to build graphs layer by layer, the same as ggplot2, on the contrast, it is more friendly to visualize network than ggplot2.

Tidygraph: This package provides a tidy API for graph/network manipulation. The nature of network data itself is not tidy, however, in the tidygraph framework, network data is considered as two tidy data tables, one describing the node data and the other is for edge data. tidygraph provides a way to switch between the two tables and provides dplyr verbs for manipulating them. Furthermore it provides access to a lot of graph algorithms with return values that facilitate their use in a tidy workflow.

Interactive visualization:

networkD3: this package allows implementing interactive network analysis easier, plots are not a static "as it is" data representation, but allow users to explore data points, hierarchies among the data, filter data by groups, and similar.

Group09 Proposal

Contents

Background

Motivation

Relevant researches

Dataset Overview

Methodology

Similarity

Centrality

Community detection

Visualization deliverables

Relevant packages

Further exploration

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools