Difference between revisions of "Group09 Report"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 49: Line 49:
 
==Data process ==
 
==Data process ==
 
===Dataset Overview===
 
===Dataset Overview===
*The occurrence of disease terms and symptom terms and their tf-idf score<br>
+
<div style="margin:0px; padding: 2px; font-family: Arial; border-radius: 1px; text-align:left">
The symptoms-disease dataset is from Nature human symptoms-disease network (HSDN), which is the combination of the MeSH vocabulary and the PubMed literature. Filtering seven contagious diseases (same as below) for consistency purpose.<br>
+
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
[[File:Group9_table1.png|800px|left]]<br>
+
|-
 
+
|
*US contagious diseases from 1916-2010 <br>
+
Table view
The record of contagious disease is from Kaggle, which includes standardized counts at the state level for smallpox, polio, measles, mumps, rubella, hepatitis A, and whooping cough from weekly National Notifiable Disease Surveillance System (NNDSS) reports for the United States. The time period of data varies per disease is between 1916 and 2010.<br>
+
||
[[File:Group9_table2.png|800px|center]]<br>
+
Description
 
+
|-
*US population from 1916 -2010<br>
+
|
 +
[[File:Group9_table1.png|600px|left]]
 +
||
 +
The symptoms-disease dataset is from Nature human symptoms-disease network (HSDN), which is the combination of the MeSH vocabulary and the PubMed literature. Filtering seven contagious diseases (same as below) for consistency purpose.
 +
|-
 +
|
 +
[[File:Group9_table2.png|600px|left]]
 +
||
 +
US contagious diseases from 1916-2010 <br>
 +
The record of contagious disease is from Kaggle, which includes standardized counts at the state level for smallpox, polio, measles, mumps, rubella, hepatitis A, and whooping cough from weekly National Notifiable Disease Surveillance System (NNDSS) reports for the United States. The time period of data varies per disease is between 1916 and 2010.
 +
|-
 +
|-
 +
|
 +
[[File:Group9_table3.png|200px|center]]
 +
||
 +
US population from 1916 -2010<br>
 
Population record is collected from US statistics, and it is the country level. <br>
 
Population record is collected from US statistics, and it is the country level. <br>
[[File:Group9_table3.png|200px|center]]<br>
+
|}
 +
<br>
 
===Data Wrangling===
 
===Data Wrangling===
Prepare Network Data:<br>
+
<div style="margin:0px; padding: 2px; font-family: Arial; border-radius: 1px; text-align:left">
[[File:Group9 2.jpg|400px|center]]
+
{| class="wikitable" style="background-color:#FFFFFF;" width="100%"
And the processed table is following:  
+
|-
[[File:Group9 3.jpg|300px|center]]
+
|
[[File:Group9 4.png|320px|center]]<br>
+
Steps
 
+
||
Prepare Analysis Data: <br>
+
Procesure
 +
|-
 +
|-
 +
|
 +
Step 1: Prepare Network Data
 +
||
 +
[[File:Group9_table2.png|400px|left]]<br>
 +
After processing:
 +
[[File:Group9_table4.png|400px|left]]<br>
 +
[[File:Group9_table5.png|400px|left]]
 +
|-
 +
|-
 +
|
 +
Step 2: Prepare Analysis Data:  
 +
||
 
[[File:Group9 5.jpg|400px|center]]<br>
 
[[File:Group9 5.jpg|400px|center]]<br>
 
+
|-
<bold>Calculated fields:</bold>
+
|}
Filter the disease count > 1<br>
 
Real cases = sum of cases<br>
 
Real_rate = sum of cases / population<br>
 
Estimated = (estimated_per_capita/100000) *Population <br>
 
Estimated_rate = estimated sum of (incidence_per_capita/52)<br>
 
Decade is calculated by per 10 years<br>
 
 
 
===Methodology===
 
*Centrality:<br>
 
 
 
The centrality of a node / edge measures how central (or important) is a node or edge in the network.<br>
 
 
 
 
 
Betweenness centrality of node  is given by the expression<br>
 
[[File:Group9 6.png|200px|center]]
 
 
 
where is the total number of shortest paths from node s to node t and is the number of those paths that pass through.<br>
 
 
 
Closeness centrality of a node is a measure of centrality in a network, calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. Thus, the more central a node is, the closer it is to all other nodes.<br>
 
[[File:Group9 7.png|200px|center]]
 
where d(x,y) is the distance between vertices x and y. When speaking of closeness centrality, people usually refer to its normalized form which represents the average length of the shortest paths instead of their sum. It is generally given by the previous formula multiplied by N-1, where N is the number of nodes in the graph. For large graphs this difference becomes inconsequential so the -1 is dropped resulting in:<br>
 
[[File:Group9 8.png|200px|center]]
 
This adjustment allows comparisons between nodes of graphs of different sizes.<br>
 
 
 
 
 
Taking distances from or to all other nodes is irrelevant in undirected graphs, whereas it can produce totally different results in directed graphs (e.g. a website can have a high closeness centrality from an outgoing link, but low closeness centrality from incoming links).<br>
 
 
 
 
 
Eigenvector centrality is a measure of the influence of a node in a network. Relative scores are assigned to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. A high eigenvector score means that a node is connected to many nodes who themselves have high scores. <br>
 
 
 
 
 
*Community Detection:<br>
 
 
 
Commonly, there are two ways of doing community detection on networks, one relies on the clustering algorithm, the other relies on network structure (graph typology). We have tried k-means clustering and pam clustering algorithm to perform community detection, and both of them work not well on our data, one possible reason is there are only three attributes can be used for clustering. Hence, we choose modularity detection algorithms, which are embedded in R package “igraph” to perform community detection.<br>
 
 
 
 
 
It was designed to measure the strength of division of a network into modules (also called groups, clusters or communities). Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks.<br>
 
 
 
 
 
*Purpose of Applying Above Algorithms<br>
 
 
 
For researchers, they would like to explore the data with their domain knowledge, we offer such algorithms as alternative options for them to explore the relationships between diseases and symptoms from unrelated but still scientific perspectives. To make the algorithm result more readable for users without knowing graph theory, we calculate the mean of the clustering score as the network score for the user to choose. The higher the score, the better the algorithm works. The score indicates how reliable the clustering algorithm is.<br>
 

Revision as of 22:53, 13 August 2018

Overview Proposal Poster Application Report



Introduction

Infectious diseases are caused by pathogenic microorganisms, such as bacteria, viruses, parasites or fungi; the diseases can be spread, directly or indirectly, from person to person, even from animals to humans. Zoonotic diseases are infectious diseases of animals that can cause disease when transmitted to humans.

The 21st century has already been marked by major epidemics. Old diseases - cholera, plague and yellow fever - have returned, and new ones have emerged - SARS, pandemic influenza, MERS, Ebola and Zika. These epidemics and their impact on global public health are quite remarkable.

Although disease patterns change constantly, communicable diseases remain the leading cause of mortality and morbidity in the least and less developed countries. Despite decades of economic growth and development in countries that belong to the World Health Organization (WHO) South-East Asia Region, most countries in this region still have a high burden of communicable diseases. This raises some urgent concerns. The first is that despite policies and interventions to prevent and control communicable diseases, most countries have failed to eradicate vaccine-preventable diseases. Second, sustainable financing to scale up interventions is lacking, especially for emerging and re-emerging diseases that can produce epidemics.

Objectives and Motivations

Diseases are prevalent no matter in which society, whilst, as the economy developing, healthcare becomes the major concern in daily life. Recently, there are still a lot of contagious diseases such as TB, malaria, cholera and meningitis, influenza A(H5N1) virus (avian flu), severe acute respiratory syndrome(SARS) and chikungunya reach high epidemic proportions in some countries, especially in developing countries. Thus, we want to apply visual analytics techniques to analyze historical records of seven contagious diseases: Smallpox, Rubella, Hepatitis, Measles, Polio, Mumps, Pertussis from US 1916-2010 and medical records of diseases and their corresponding symptoms. It can help us to find out patterns from these historical typical contagious diseases and apply to other diseases.

Scientific methods align with a huge amount of reliable researches always come out with the convincing and inspiring result. we intend to use a large-scale biomedical literature database to construct a symptom-based human disease network and investigate the connections with related diseases.

Nonetheless, this project also serves following purposes: 1) provide exploratory analysis of datasets; 2) aid domain experts seek for unexpected association among diseases as well as validate their research results; 3) bridge the gap between knowledge obtained by experts only as well as produced at the lab bench and its use at the clinical bedside; 4) non-specialists can gain straightforward and useful information (e.g. which symptom suspiciously causes a specific contagious disease) from the application.

Previous Work

Summary

In human symptoms-disease [3], previous researchers used a large-scale biomedical literature database constructing a symptom-based human disease network and investigate the connection between clinical manifestations of diseases and their underlying molecular interactions. They demonstrated the similarity of two diseases correlates strongly with the number of shared symptoms.
Their research starts from crawling large-scale bibliographic records PubMed, they used its related Medical Subject Headings (MeSH) to extract symptom terms and disease terms from the bibliographies and applied text analysis techniques to generate co-occurrence and to calculate the TF-IDF score of each pair of symptoms and diseases. The dataset after their processing contains hundreds of diseases and thousands of symptoms.

Group9 1.jpg


Shortages

  1. It is very difficult to see the trend and spread area for those public users without any domain knowledge.
  2. Previous symptoms-disease network contains the majority of human diseases, the relationship among diseases and symptoms is obscured with such a large amount of data.
  3. Their data source is crawling medical reports from medical websites, as they have mentioned, the number of reports is less than the real incidences.

To improve, we obtain contagious disease records from Kaggle, which records the number of contagious incidence in US from 1916 to 2010.

Data process

Dataset Overview

Table view

Description

Group9 table1.png

The symptoms-disease dataset is from Nature human symptoms-disease network (HSDN), which is the combination of the MeSH vocabulary and the PubMed literature. Filtering seven contagious diseases (same as below) for consistency purpose.

Group9 table2.png

US contagious diseases from 1916-2010
The record of contagious disease is from Kaggle, which includes standardized counts at the state level for smallpox, polio, measles, mumps, rubella, hepatitis A, and whooping cough from weekly National Notifiable Disease Surveillance System (NNDSS) reports for the United States. The time period of data varies per disease is between 1916 and 2010.

Group9 table3.png

US population from 1916 -2010
Population record is collected from US statistics, and it is the country level.


Data Wrangling

Steps

Procesure

Step 1: Prepare Network Data

Group9 table2.png

After processing:

Group9 table4.png

Group9 table5.png

Step 2: Prepare Analysis Data:

Group9 5.jpg