ISSS608 2016-17T3 Group15 Report

From Visual Analytics and Applications
Jump to navigation Jump to search
Logo.jpg


ISSS608 Visual Analytics and Applications

PandemViz: An interactive analytics tool for understanding pandemic outbreaks through data visualisation


Proposal

Poster

Application

Report

 


By Chua Gim Hong, Huang Liwei and Ngo Siew Hui

Abstract

A pandemic is an epidemic or outbreak of infectious disease that spreads rapidly not only to many people, but across countries. The unprecedented mobility of people and food over the last 30 years has seen a steady increase in the frequency and diversity of disease outbreaks. No country is immune to this growing global threat. Scientists are predicting that it is not a matter of if, but when the next pandemic will happen. Singapore, as a small city state, with the highest population density in the world and one of the highest air passenger traffic, is particularly vulnerable.

There are reasons to remain optimistic, as Singapore’s SMART Nation initiatives and modern healthcare systems’ electronic records have open up new possibilities in the fight against potential infectious disease outbreaks in the country. Data will be increasingly ubiquitous as the world, including Singapore, continues to make significant advancement in the digitalisation age. Insights from the data have the potential to offer a critical line of preparedness needed through early identification, rapid effective response, and containment of disease outbreaks. To leverage on this increasing availability of data, we will need appropriate and affordable data exploratory, visualisation and analysis tool.

In view of this, our project aims to develop an interactive visual analytics tool, PandemViz, using R Shiny and R data visualization packages such as joy plot, calendar heatmap and trellis plot. PandemViz will be useful for understanding pandemic outbreaks through data visualisation. In our development, R programming will be used to analyse a synthetic dataset (i.e. computer- and human-generated data) relating to a major disease outbreak that spanned several cities across the world in 2009. In an actual disease outbreak scenario, PandemViz can potentially be used by health officials to analyse the hospitalisation data to understand the spread of the pandemic across countries so as to mount effective responses as part of overall efforts to contain the pandemic.

This presentation consists of four main sections. First, the motivation and objectives of the project will be discussed. This is followed by a detailed discussion on the principles and concepts of key visualisation methods used. After which, the R packages used to develop the application and the user-interface designed will be discussed. Using the synthetic dataset, we will demonstrate how the functions of our tools can be used to detect the patterns and attribute distributions that characterize a pandemic spread. The efficacy of each of these visual analytics techniques will be discussed in detail. The presentation will conclude with a sharing of valuable insights gained through working on the project and potential application areas of our visualisation tool. We will also suggest possibilities for future works by combining hospital records with other data sources.

[VAST Challenge 2010 - Characterisation of Pandemic Spread]

Motivation of the Application

Motivation perspectives

There are various perspectives that motivates our application:

Epidemiological

Figure 1 shows the staggering number of deaths caused by pandemics throughout history, not to mention the millions of sufferers who survived the ordeal.
A pandemic is an epidemic or outbreak of infectious disease that spreads rapidly not only to many people, but across geographical areas. Many pandemic experts, believe that a pandemic would occur sometime in the next two generations.

Figure 1 – Outbreak of Pandemic in History.
Deadliest Pandemic.jpg

Reference: https://www.good.is/infographics/infographic-the-deadliest-disease-outbreaks-in-history


The enemy is microscopic, but the effect is devastating. In some cases, a part or whole of a city was wipe out (see Figure 2). Epidemiologists are bracing themselves for what has been called the next "Big One" - a disease that could kill tens of millions of people.

Figure 2 - A time perspective of deadliest pandemic outbreaks in history.
Pandemic History.png

References:
List of Epidemics: https://en.wikipedia.org/wiki/List_of_epidemics
Plague of Athens: https://en.wikipedia.org/wiki/Plague_of_Athens
Black Death: https://en.wikipedia.org/wiki/Black_Death
Antonine Plague: https://en.wikipedia.org/wiki/Antonine_Plague
Plague of Justinian: https://en.wikipedia.org/wiki/Plague_of_Justinian
Cholera's seven pandemics Disease has killed millions since 19th century: http://www.cbc.ca/news/technology/cholera-s-seven-pandemics-1.758504
Spanish Flu: https://en.wikipedia.org/wiki/1918_flu_pandemic
HIV/Aids: http://www.who.int/gho/hiv/en/


Why does the experts believe that the next Pandemic is a matter of when and not if (see Figure 3)?
In the history and science of contagious diseases, human beings have put mankind at risk by encroaching on wildlife habitats. About 60 percent of our new pathogens come from the bodies of animals. When we encroach into wildlife habitat or when we disrupt it in ways that brings people and animals into close contact, their microbes start to spill over to our bodies. (Reference: 'Pandemic' Asks: Is A Disease That Will Kill Tens of Millions Coming? http://www.npr.org/sections/health-shots/2016/02/22/467637849/pandemic-asks-is-a-disease-that-will-kill-tens-of-millions-coming)
Another reason is the unprecedented mobility of people and food over the last 30 years has seen a steady increase in the frequency and diversity of disease outbreaks.

Figure 3 - Animal Pathogen.
Animal Pathogen.png

References:
Animal Pathogen: https://www.pinterest.com/pin/488218415827641787/
The Deadly Ebola virus: http://iacld.ir/en/index.php?option=com_content&view=article&id=829&Itemid=346
The MERS virus: https://www.yahoo.com/news/korea-reports-seven-mers-cases-one-suspect-flies-033321726.html
The next wave: https://s-media-cache-ak0.pinimg.com/736x/a7/61/9e/a7619eed12d93d40bab45dc102b42c8b--bird-flu-morning-post.jpg
Top 10 deadly disease in Africa: https://ask.naij.com/health/top-10-deadly-diseases-in-africa-i23506.html
What To Know More About The New Bird Flu Virus: http://news.northeastern.edu/2013/05/3qs-what-to-know-about-the-new-bird-flu-virus/

National Security

Are we prepared for the next Pandemic? Singapore, as a small city state, with world’s highest population density and one of the highest air passenger traffic, is particularly vulnerable. International travel is a significant risk factor in the spread of disease. Air travel shapes our epidemics in such a powerful way that scientists can actually predict where and when an epidemic will strike next just by measuring the number of direct flights between infected and uninfected cities.

Figure 4 – Are We Prepared Against This National Security Threat?
Are We Prepared.png

Visual Analytics

We are motivated by “Democratising Data and Analytics With Visual Analytics”, which consist of two key factors, through the experience and observations of Prof Kam:
a. Data Accessibility. Although there has been improvement in recent years in breaking down data accessibility barrier in organization, they still exist. Many of these data are stored or distributed in a format which is not easily understood or can readily be used by casual users.
b. Analysis tools. Another barrier to data democratisation is the availability of appropriate tools to help analyze the data. These tools are needed to allow those without a data analysis background to easily extract meaning from the data.

Leveraging National Initiatives & Health Systems

There are reasons to remain optimistic, as Singapore’s SMART Nation initiatives and modern healthcare systems’ electronic records have open up new possibilities to democratise data in the fight against potential infectious disease outbreaks in the country. Data will be increasingly ubiquitous as the world, including Singapore, continues to make significant advancement in the digitalisation age. Insights from the data have the potential to offer a critical line of preparedness needed through early identification, rapid effective response, and containment of disease outbreaks. To leverage on this increasing availability of data, we will need appropriate and affordable data exploratory, visualisation and analysis tool.

Why R?

Our team has chosen R Programming Language because of its strong package ecosystem and charting benefits:
1. R is not only free, but open-source
(Reference: Applications Of R Programming In R-eal World https://elearningindustry.com/applications-r-programming-r-eal-world, By Vaishnavi Agrawal, February 25, 2016.)
2. Integrates with other languages: C/C++, Java, Python.
(Reference: R (programming language) https://en.wikipedia.org/wiki/R_(programming_language)#cite_note-33)
3. Works on Windows, Macintosh, Linux and Unix platforms
(Reference: Statistical Consulting Group UCLA, Academic Technology Services, Technical Report Series, 2006 January 30, revised 2007 February 27, Report Number 1, Comment Number 1, R Relative to Statistical Packages: Comment 1 on Technical Report Number 1 (Version 1.0), Strategically using General Purpose Statistics Packages: A Look at Stata, SAS and SPSS http://www.burns-stat.com/pages/Tutor/R_relative_statpack.pdf)
4. Able to communicate with many data sources: ODBC-compliant databases (Excel, Access) and other statistical packages (SAS, Stata, SPSS, Minitab).
5. Explicitly records actions of analysis and make it easy to reproduce and update report, which means it can quickly try many ideas and factual issues.
6. The capabilities of R are extended through user-created packages, which allow specialized statistical techniques, graphical devices (such as the ggplot2 package developed by Hadley Wickham), import/export capabilities, reporting tools (knitr, Sweave), etc. These packages are developed primarily in R, and sometimes in Java, C, C++, and Fortran. A core set of packages is included with the installation of R, with more than 11,000 additional packages (as of July 2017) available at the Comprehensive R Archive Network (CRAN), Bioconductor, Omegahat, GitHub, and other repositories.
(Reference: R (programming language) https://en.wikipedia.org/wiki/R_(programming_language)#cite_note-33)
7. Shiny, an R-based tool for producing interactive, web-ready data visualizations. Shiny is an R package that makes it easy to build interactive web applications (apps) straight from R. It is an R package and web application framework, which can build interactive web applications quickly in the same environment. Shiny also has a comprehensive list of widgets to implement interactive features such as selection button and input slider. It also allows any User-Interface interactions like click, hover, brush for users to perform deeper exploration of the data.
(Reference: Why use Shiny? https://www.lynda.com/RStudio-tutorials/Why-use-Shiny/452087/490039-4.html

Limitations of R

(Reference: Why R? The pros and cons of the R language http://www.infoworld.com/article/2940864/application-development/r-programming-language-statistical-data-analysis.html By Paul Krill, Editor at Large, InfoWorld | Jun 30, 2015.)
The basic principle of R emanates from programming languages built in the 1960s. R's shortcomings are in security and memory management.
1. Memory management, speed, and efficiency are probably the biggest challenges R faces. The design of the language can sometimes pose problems in working with very large data sets. Data has to be stored in physical memory. But as computers have gotten more memory, this has become less of an issue.”
2. Capabilities such as security were not built into the R language. Also, R cannot be embedded in a Web browser i.e. cannot use it for Web-like or Internet-like apps. It was basically impossible to use R as back-end server to do calculations because of its lack of security over the Web. The security issue, however, has been lessened by developments such as the use of virtual containers on the Amazon Web Services cloud platform.
However, the benefits of R outweigh the limitations. Strides have been -- and are still being -- made to make progress on those fronts.”

Review and Critic on Past Works

The scenario, which is also applicable to our project, was major epidemic outbreak that spanned 11 cities across the world in 2009. Disease tended to move fast and be fairly difficult to combat. The past work that we review aims to analyse the illness across these countries to help understand the spread of the disease.
The dataset comprises 22 Excel csv files, 11 csv files each for patient admission and death, with date ranging from 16th April to 29th June 2009:

  • There is a total of 14M Data Records for admission and 350K Data Records for death.
  • The records span 11 Cities in different countries.
  • The fields include Admission/Death Records with Patient IDs, Symptoms, Date and Patient ID of Gender and Age.
Figure 5: 92 out of 1,294 syndrome categories made up 97% of records


Syndrome.png

In our initial data exploration and analysis, we found that 92 out of 1,294 syndrome categories made up 97% of records. This was not highlighted in the VAST Challenge 2010 winner’s report. What this means is that although the dataset is large, with 1,294 syndrome categories, we need to focus only on the 92 of them and group them relevantly. This is an example of what we mean by democratising data analytics by making data and tools accessible. Data has to be stored or distributed in a format which is easily understood or can readily be used by casual users. This synthetic dataset shows the difficulty if data is messy with plenty of inconsistent category name and spelling errors, had it not for 92 categories which made up the vast majority of values. There has to be availability of appropriate tools to help analyze the data - tools to allow those without a data analysis background to easily extract meaning from the data.

Critique Past Works References
Figure 6: Lack clarity and aesthetic because:

• Small fonts in title, axis labels & legend.
• Vertical text orientation on x-axis difficult to read.
• Dying patients graph too close to x-axis.
• Not enough fidelity on y-axis to read values.
• Dying data are not normalised by total number of patients in each syndrome.

Bar Admission Death.png University of Constance - Applied Visual Analytics“, VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread.

Authors and Affiliations:
Marc Rene Broghammer, University of Constance, broghama@inf.uni-konstanz.de, Juergen Schniertshauer, University of Constance, schniert@inf.uni-konstanz.de, Dr. Peter Bak, University of Constance, Peter.Bak@uni-konstanz.de.
Tool(s): Konstanz Information Miner (KNIME).

Figure 7: Lack clarity and aesthetic because:

• Vertical text orientation on x-axis difficult to read.
• Excessive salience in non-data components distract from data.
• Unnecessary non-data ink.
• Not sorted nor ranked

Bar Admission.png

Bar death.png

Bangor - VASTvis - MC2 VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:
Rick Walker, Research Institute of Visual Computing (RIVIC), Bangor University, rick.walker@bangor.ac.uk
Llyr Ap Cenydd, RIVIC, Bangor University, ees60d@bangor.ac.uk
Serban Pop, RIVIC, Bangor University, serban@bangor.ac.uk
Jonathan Roberts, RIVIC, Bangor University, j.c.roberts@bangor.ac.uk
Tool(s): The VASTVis tool was specifically developed, in Processing, for the 2010 challenge by Llyr Ap Cenydd and Rick Walker.

Figure 8: Lack clarity and aesthetic because:

• Same colours used for different categories.
• Same symbols for different categories.
• Overlapping plot lines.
• Difficult to read small, cluttered & slanted labels on x-axis.
• Labels on y-axis too small.
• No light horizonal grid to read y-axis values.
• Legend not sorted by height of curves.

Line admission.png Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:
Marc Rene Broghammer, University of Constance, broghama@inf.uni-konstanz.de
Juergen Schniertshauer, University of Constance, schniert@inf.uni-konstanz.de
Dr. Peter Bak, University of Constance, Peter.Bak@uni-konstanz.de
Tool(s): Konstanz Information Miner (KNIME).

Figure 9: Lack clarity and aesthetic because:

• Excessive salience in non-data components distract from data.
• Unnecessary non-data ink.
• Lines bundled up.

Line cumulative death.png Bangor - VASTvis - MC2 VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:
Rick Walker, Research Institute of Visual Computing (RIVIC), Bangor University,rick.walker@bangor.ac.uk
Llyr Ap Cenydd, RIVIC, Bangor University, ees60d@bangor.ac.uk Serban Pop, RIVIC, Bangor University, serban@bangor.ac.uk
Jonathan Roberts, RIVIC, Bangor University, j.c.roberts@bangor.ac.uk
Tool(s): The VASTVis tool was specifically developed, in Processing, for the 2010 challenge by Llyr Ap Cenydd and Rick Walker.

Figure 10: Lack clarity and aesthetic because:

• Difficult to differentiate sizes without numerical values.
• Too much non-data space below.

Recovery rates.png Purdue University: Vaccinated VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:
Abish Malik, Purdue University [Primary Contact], amalik@purdue.edu
Shehzad Afzal, Purdue University [Primary Contact], safzal@purdue.edu
Erin Hodgess, University of Houston Downtown [Faculty Advisor], HodgessE@uhd.edu
David S. Ebert, Purdue University [Faculty Advisor], ebertd@purdue.edu
Ross Maciejewski, Purdue University [Faculty Advisor], rmacieje@purdue.edu
Tool: A tool developed to utilize linked geographic and temporal views for exploring disease spread.

Figure 11: Lack clarity and aesthetic because:

• Only 11 countries affected.
• World map excessive salience in non-data components distract from data.
• Excessive non-data space.

Geospaptial death.png Periscopic Aggregate Symptoms Visualization VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:
Kim Rees, Periscopic, kim@periscopic.com
Tool(s):
Tableau Desktop Software was used for the majority of visual analysis. Tableau is a data visualization and business intelligence tool.
http://tableausoftware.com
Microsoft Excel was also used for additional data formatting.
http://office.microsoft.com/en-us/excel/

Figure 12: Lack clarity and aesthetic because:

• Only 11 countries affected.
• World map is excessive salience in non-data components distract from data.
• Excessive non-data space.

Geospatial timeslider.png Purdue University: Vaccinated VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:
Abish Malik, Purdue University [Primary Contact], amalik@purdue.edu
Shehzad Afzal, Purdue University [Primary Contact], safzal@purdue.edu
Erin Hodgess, University of Houston 􀬛 Downtown [Faculty Advisor], HodgessE@uhd.edu
David S. Ebert, Purdue University [Faculty Advisor], ebertd@purdue.edu
Ross Maciejewski, Purdue University [Faculty Advisor], rmacieje@purdue.edu
Tool: A tool developed to utilize linked geographic and temporal views for exploring disease spread.

Figure 13: Lack clarity and aesthetic because:

• Extremely messy lines bundled up.

Lines bundled.png Periscopic Aggregate Symptoms Visualization VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread

Authors and Affiliations:
Kim Rees, Periscopic, kim@periscopic.com
Tool(s):
Tableau Desktop Software was used for the majority of visual analysis. Tableau is a data visualization and business intelligence tool.
http://tableausoftware.com
Microsoft Excel was also used for additional data formatting.
http://office.microsoft.com/en-us/excel/

Design Framework

Joy Plot

A series of histograms, density plots or time series for a number of data segments, all aligned to the same horizontal scale and presented with a slight overlap. They can be quite useful for visualizing changes in distributions over time or space. Avoid the overlapping issues with line plots.
(Reference: http://blog.revolutionanalytics.com/2017/07/joyplots.html)

Figure 14: Joy Plot: Novel visualisation approach with ‘ggjoy’ package. The pandemic first peaked at Nairobi (Kenya), followed by Aleppo (Syria).


Joy plot pandemic.jpg

The name "Joy Plot" was apparently coined by Jenny Bryan in April 2017, in response to one of Lindberg's earlier visualizations using this style. (The community appears to have settled on 'joyplot' since then.) The name refers to the classic 1979 Joy Division "Unknown Pleasures" album cover, which was in actuality a joyplot of radio intensities from the first known pulsar. The album cover reproduced the design from a 1971 Scientific American article about pulsars.

Trellis Plot

Curse of dimensionality (HUBER 1985) is found not only in mathematical statistical problems, but also in visual analysis i.e. displaying data with 3-D or more on a 2-D is not easy.
• Histograms or boxplots can only handle one single variable.
• Pie charts & bar graphs do not allow easy comparisons across all variables.
• Scatterplots can cope with two continuous variables, rotating plots with three.
• Mosaic plots (UNWIN 1995) can deal with a lot of categorical variables – although interpretation may be hard.

Trellis display data through multiple panels, each with some variables while other variables held fixed.
Trellis display able to distinguish variables without use of colour.
References:
https://pdfs.semanticscholar.org/59a8/cb97df43ddb70776325f43c8aeae4c0fc4fd.pdf
https://onlinelibrary.wiley.com/doi/10.1002/wics.121/abstract

Figure 15: Trellis Plot: Overview of all countries with no overlapping of data lines. Thailand and Turkey are likely not affected by pandemic.


Trellis pandemic.jpg

Calendar Heatmap

Avoid occlusion in dot plot, where dots have equal emphasis and difficult to tell regions with lesser or more points, especially at farthest zoom levels.
With heat mapping, it is clearer which regions has more points than other regions.
Reference: https://www.r-bloggers.com/time-series-calendar-heat-maps-using-r/

Figure 16: Calendar Heatmap: Day-by-day view of affected cases in Epi-Week. No prolonged peak periods with abnormally high no. of deaths for Thailand and Turkey. (Addendum: All countries showed patterned distribution of peaks except for Thailand and Turkey which have random peaks.)


Heatmap pandemic.png

Bar Chart

Old fashion bar charts can still be very useful (see Figure 17). Although, later we are going to share why we should not allow it to inhibit our creativity, as one of our key learning take-aways.

Figure 17: Customisation made easy with ‘ggplot2’ package. Unusually high mortality rate for affected countries; Aleppo (Syria) and Nairobi (Kenya) are the worst-hit countries.

Bar mortality.png

Demonstration

Please proceed to the Application tab to access the interactive demonstration.

Discussion

What Have We Learned from Our Work?

  • Democratising Data and Analytics with Visual Analytics (as discussed above under "Motivation Of This Application")
  • Although some graphs from the VAST Challenge 2010 submissions were able to serve their purpose, a closer observations reveal issues with clarity and aesthetics that could be addressed through Joy Plots, Trellis Plot and Heat Maps.
  • Constraint drives creativity. While the good old fashion bar graph is still useful to compare data among categories, by imposing constraints not to use bar graphs, one could explore more creatively other data visualisation graphs such as Joy Plots, Trellis Plot and Heat Maps.

What New Insights or Practices Has Our System Enabled?

  • List of countries/cities affected by the pandemic
  • Temporal analysis of the pandemic spread across countries/cities (i.e. in which order)
  • Severity of disease outbreak in each country/city
  • Identification of symptoms which could be linked to the disease outbreak

Future Work

In future, our system could be extended or refined as follows:

  • Incorporate clinical diagnosis codes for better classification of syndromes
  • Incorporate social data analysis for surveillance scanning, and
  • Analyse with external datasets on patients’ medical history, population demography, mobile geospatial data (population & if available, patient), immigration records etc.

Installation guide

The PandemViz tool has been deployed via https://www.shinyapps.io/ to enable easy online access through a browser with internet connection.

User Guide

To access the online PandemViz tool, please click on the URL below. Further instructions and explanations are provided on the dashboards.

https://isss608-g15.shinyapps.io/b6pandemviz/