ISSS608 2016-17T3 Group15 Report
ISSS608 Visual Analytics and Applications
PandemViz: An interactive analytics tool for understanding pandemic outbreaks through data visualisation
|
|
|
|
By Chua Gim Hong, Huang Liwei and Ngo Siew Hui
Contents
Abstract
A pandemic is an epidemic or outbreak of infectious disease that spreads rapidly not only to many people, but across countries. The unprecedented mobility of people and food over the last 30 years has seen a steady increase in the frequency and diversity of disease outbreaks. No country is immune to this growing global threat. Scientists are predicting that it is not a matter of if, but when the next pandemic will happen. Singapore, as a small city state, with the highest population density in the world and one of the highest air passenger traffic, is particularly vulnerable.
There are reasons to remain optimistic, as Singapore’s SMART Nation initiatives and modern healthcare systems’ electronic records have open up new possibilities in the fight against potential infectious disease outbreaks in the country. Data will be increasingly ubiquitous as the world, including Singapore, continues to make significant advancement in the digitalisation age. Insights from the data have the potential to offer a critical line of preparedness needed through early identification, rapid effective response, and containment of disease outbreaks. To leverage on this increasing availability of data, we will need appropriate and affordable data exploratory, visualisation and analysis tool.
In view of this, our project aims to develop an interactive visual analytics tool, PandemViz, using R Shiny and R data visualization packages such as joy plot, calendar heatmap and trellis plot. PandemViz will be useful for understanding pandemic outbreaks through data visualisation. In our development, R programming will be used to analyse a synthetic dataset (i.e. computer- and human-generated data) relating to a major disease outbreak that spanned several cities across the world in 2009. In an actual disease outbreak scenario, PandemViz can potentially be used by health officials to analyse the hospitalisation data to understand the spread of the pandemic across countries so as to mount effective responses as part of overall efforts to contain the pandemic.
This presentation consists of four main sections. First, the motivation and objectives of the project will be discussed. This is followed by a detailed discussion on the principles and concepts of key visualisation methods used. After which, the R packages used to develop the application and the user-interface designed will be discussed. Using the synthetic dataset, we will demonstrate how the functions of our tools can be used to detect the patterns and attribute distributions that characterize a pandemic spread. The efficacy of each of these visual analytics techniques will be discussed in detail. The presentation will conclude with a sharing of valuable insights gained through working on the project and potential application areas of our visualisation tool. We will also suggest possibilities for future works by combining hospital records with other data sources.
[VAST Challenge 2010 - Characterisation of Pandemic Spread]
Motivation of the Application
Motivation perspectives
There are various perspectives that motivates our application:
Epidemiological
Figure 1 shows the staggering number of deaths caused by pandemics throughout history, not to mention the millions of suffers who survived the ordeal. |
|
Reference: https://www.good.is/infographics/infographic-the-deadliest-disease-outbreaks-in-history | |
The enemy is microscopic, but the effect is devastating. In some cases, a part or whole of a city was wipe out (see Figure 2). Epidemiologists are bracing themselves for what has been called the next "Big One" - a disease that could kill tens of millions of people. |
Figure 2 - A time perspective of deadliest pandemic outbreaks in history. |
References: | |
Why does the experts believe that the next Pandemic is a matter of when and not if (see Figure 3)? |
Figure 3 - A time perspective of deadliest pandemic outbreaks in history. |
References: |
National Security
Visual Analytics
We are motivated by “Democratising Data and Analytics With Visual Analytics”, which consist of two key factors, through the experience and observations of Prof Kam:
a. Data Accessibility. Although there has been improvement in recent years in breaking down data accessibility barrier at both the public and organization, they still exist. Many of these data are stored or distributed in a format which is not easily understood or can readily be used by casual users.
b. Analysis tools. Another barrier to data democratisation is the availability of appropriate tools to help analyze the data. These tools are needed to allow those without a data analysis background to easily extract meaning from the data.
Leveraging National Initiatives & Health Systems
There are reasons to remain optimistic, as Singapore’s SMART Nation initiatives and modern healthcare systems’ electronic records have open up new possibilities to de-mock-cratise data in the fight against potential infectious disease outbreaks in the country. Data will be increasingly ubiquitous as the world, including Singapore, continues to make significant advancement in the digitalisation age. Insights from the data have the potential to offer a critical line of preparedness needed through early identification, rapid effective response, and containment of disease outbreaks. To leverage on this increasing availability of data, we will need appropriate and affordable data exploratory, visualisation and analysis tool.
Why R?
Our team has chosen R Programming Language because of its strong package ecosystem and charting benefits:
1. R is not only free, but open-source
(Reference: Applications Of R Programming In R-eal World https://elearningindustry.com/applications-r-programming-r-eal-world, By Vaishnavi Agrawal, February 25, 2016.)
2. Integrates with other languages: C/C++, Java, Python.
(Reference: R (programming language) https://en.wikipedia.org/wiki/R_(programming_language)#cite_note-33)
3. Works on Windows, Macintosh, Linux and Unix platforms
(Reference: Statistical Consulting Group UCLA, Academic Technology Services, Technical Report Series, 2006 January 30, revised 2007 February 27, Report Number 1, Comment Number 1, R Relative to Statistical Packages: Comment 1 on Technical Report Number 1 (Version 1.0), Strategically using General Purpose Statistics Packages: A Look at Stata, SAS and SPSS http://www.burns-stat.com/pages/Tutor/R_relative_statpack.pdf)
4. Enables to communicate with many data sources: ODBC-compliant databases (Excel, Access) and other statistical packages (SAS, Stata, SPSS, Minitab).
5. Explicitly records actions of analysis and make it easy to reproduce and update report, which means it can quickly try many ideas and factual issues.
6. The capabilities of R are extended through user-created packages, which allow specialized statistical techniques, graphical devices (such as the ggplot2 package developed by Hadley Wickham), import/export capabilities, reporting tools (knitr, Sweave), etc. These packages are developed primarily in R, and sometimes in Java, C, C++, and Fortran. A core set of packages is included with the installation of R, with more than 11,000 additional packages (as of July 2017) available at the Comprehensive R Archive Network (CRAN), Bioconductor, Omegahat, GitHub, and other repositories.
(Reference: R (programming language) https://en.wikipedia.org/wiki/R_(programming_language)#cite_note-33)
7. Shiny, an R-based tool for producing interactive, web-ready data visualizations. Shiny is an R package that makes it easy to build interactive web applications (apps) straight from R. It is an R package and web application framework, which can build interactive web applications quickly in the same environment. Shiny also has a comprehensive list of widgets to implement interactive features such as selection button and input slider. It also allows any User-Interface interactions like click, hover, brush for users to perform deeper exploration of the data.
(Reference: Why use Shiny? https://www.lynda.com/RStudio-tutorials/Why-use-Shiny/452087/490039-4.html
Limitations of R
(Reference: Why R? The pros and cons of the R language http://www.infoworld.com/article/2940864/application-development/r-programming-language-statistical-data-analysis.html By Paul Krill, Editor at Large, InfoWorld | Jun 30, 2015.)
The basic principle of R emanates from programming languages built in the 1960s. R's shortcomings are in security and memory management.
1. Memory management, speed, and efficiency are probably the biggest challenges R faces. The design of the language can sometimes pose problems in working with very large data sets. Data has to be stored in physical memory. But as computers have gotten more memory, this has become less of an issue.”
2. Capabilities such as security were not built into the R language. Also, R cannot be embedded in a Web browser i.e. cannot use it for Web-like or Internet-like apps. It was basically impossible to use R as back-end server to do calculations because of its lack of security over the Web. The security issue, however, has been lessened by developments such as the use of virtual containers on the Amazon Web Services cloud platform.
However, the benefits of R outweigh the limitations. Strides have been -- and are still being -- made to make progress on those fronts.”
Review and Critic on Past Works
The scenario, which is also applicable to our project, was major epidemic outbreak that spanned 11 cities across the world in 2009. Disease tended to move fast and be fairly difficult to combat. The past work that we review aims to analyse the illness across these countries to help understand the spread of the disease.
The dataset comprises 22 Excel csv files, 11 csv files each for patient admission and death, with date ranging from 16th April to 29th June 2009:
- There is a total of 14M Data Records for admission and 350K Data Records for death.
- The records span 11 Cities in different countries.
- The fields include Admission/Death Records with Patient IDs, Symptoms, Date and Patient ID of Gender and Age.
In our initial data exploration and analysis, we found that 92 out of 1,294 syndrome categories made up 97% of values. This was not highlighted in the VAST Challenge 2010 winner’s report. What this means is that although the dataset is large, with 1,294 syndrome categories, we need to focus only on the 92 of them and group them relevantly. This is an example of what we mean by democratising data analytics by making data and tools accessible. Data has to be stored or distributed in a format which is easily understood or can readily be used by casual users. This synthetic dataset shows the difficulty if data is messy with plenty of inconsistent category name and spelling errors, had it not for 92 categories which made up the vast majority of values. There has to be availability of appropriate tools to help analyze the data - tools to allow those without a data analysis background to easily extract meaning from the data.
Critique | Past Works | References |
---|---|---|
Figure 6: Lack clarity and aesthetic because: • Small fonts in title, axis labels & legend. |
University of Constance - Applied Visual Analytics“, VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread. Authors and Affiliations: | |
Figure 7: Lack clarity and aesthetic because: • Vertical text orientation on x-axis difficult to read. |
Bangor - VASTvis - MC2 VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread Authors and Affiliations: | |
Figure 8: Lack clarity and aesthetic because: • Same colours used for different categories. |
Hospitalization Records - Characterization of Pandemic Spread Authors and Affiliations: | |
Figure 9: Lack clarity and aesthetic because: • Excessive salience in non-data components distract from data. |
Bangor - VASTvis - MC2 VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread Authors and Affiliations: | |
Figure 10: Lack clarity and aesthetic because: • Difficult to differentiate sizes without numerical values. |
Purdue University: Vaccinated VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread Authors and Affiliations: | |
Figure 11: Lack clarity and aesthetic because: • Only 11 countries affected. |
Periscopic Aggregate Symptoms Visualization VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread Authors and Affiliations: | |
Figure 12: Lack clarity and aesthetic because: • Only 11 countries affected. |
Purdue University: Vaccinated VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread Authors and Affiliations: | |
Figure 13: Lack clarity and aesthetic because: • Extremely messy lines bundled up. |
Periscopic Aggregate Symptoms Visualization VAST 2010 Challenge Hospitalization Records - Characterization of Pandemic Spread Authors and Affiliations: |
Design Framework
Joyplot
A series of histograms, density plots or time series for a number of data segments, all aligned to the same horizontal scale and presented with a slight overlap. They can be quite useful for visualizing changes in distributions over time or space. Avoid the overlapping issues with line plots.
(Reference: http://blog.revolutionanalytics.com/2017/07/joyplots.html)
The name "Joy Plot" was apparently coined by Jenny Bryan in April 2017, in response to one of Lindberg's earlier visualizations using this style. (The community appears to have settled on 'joyplot' since then.) The name refers to the classic 1979 Joy Division "Unknown Pleasures" album cover, which was in actuality a joyplot of radio intensities from the first known pulsar. The album cover reproduced the design from a 1971 Scientific American article about pulsars.
Trellis Plot
Curse of dimensionality (HUBER 1985) is found not only in mathematical statistical problems, but also in visual analysis i.e. displaying data with 3-D or more on a 2-D is not easy.
• Histograms or boxplots can only handle one single variable.
• Pie charts & bar graphs do not allow easy comparisons across all variables.
• Scatterplots can cope with two continuous variables, rotating plots with three.
• Mosaic plots (UNWIN 1995) can deal with a lot of categorical variables – although interpretation may be hard.
• Trellis display data through multiple panels, each with some variables while other variables held fixed.
• Trellis display able to distinguish variables without use of colour.
References:
https://pdfs.semanticscholar.org/59a8/cb97df43ddb70776325f43c8aeae4c0fc4fd.pdf
https://onlinelibrary.wiley.com/doi/10.1002/wics.121/abstract
Heat Map
Avoid occlusion in dot plot, where dots have equal emphasis and difficult to tell regions with lesser or more points, especially at farthest zoom levels.
With heat mapping, it is clearer which regions has more points than other regions.
Reference: https://www.r-bloggers.com/time-series-calendar-heat-maps-using-r/
Bar Chart
Figure 17: Customisation made easy with ‘ggplot2’ package. Unusually high mortality rate for affected countries; Aleppo (Syria) and Nairobi (Kenya) are the worst-hit countries.
Demonstration
Please proceed to the Application tab to check up the interactive demonstration.
Discussion
What Have We Learned from Your Work?
- Democratising Data and Analytics with Visual Analytics (as discussed above under "Motivation Of This Application")
- Although some graphs from the VAST Challenge 2010 submissions were able to serve their purpose, a closer observations reveal issues with clarity and aesthetics that could be addressed through Joy Plots, Trellis Plot and Heat Maps.
- Constraint drives creativity. While the good old fashion bar graph is still useful to compare data among categories, by imposing constraints not to use bar graphs, one could explore more creatively other data visualisation graphs such as Joy Plots, Trellis Plot and Heat Maps.
What New Insights or Practices Has Our System Enabled?
- List of countries/cities affected by the pandemic
- Temporal analysis of the pandemic spread across countries/cities (i.e. in which order)
- Severity of disease outbreak in each country/city
- Identification of symptoms which could be linked to the disease outbreak
Future Work
In future, our system could be extended or refined as follows:
- Incorporate clinical diagnosis codes for better classification of syndromes
- Incorporate social data analysis for surveillance scanning, and
- Analyse with external datasets on patients’ medical history, population demography, mobile geospatial data (population & if available, patient), immigration records etc.
Installation guide
The PandemViz tool has been deployed via https://www.shinyapps.io/ to enable easy online access through a browser with internet connection.
User Guide
To access the online PandemViz tool, please click on the URL below:
https://siewhui-mitb.shinyapps.io/b6pandemviz/