ISSS608 2017-18 T1 Assign Fam Guo Teng Behind the Scenes

From Visual Analytics and Applications
Jump to navigation Jump to search
width="100%"

Background

Origin and Epidemic Spread

Hypothesis of Transmission

Containment and Resource Deployment

Behind the Scenes


Data Preparation

This section highlights some of the data preparation work done to achieve the analysis and conclusion from the previous sections.

Data Cleaning

Figure 14: JMP Pro screen capture of the erroneous Created_at fields.

During the initial data exploration and cleaning, it was discovered that the Created_at field that describes when the microblog posts have been created, had some dirty data. The timing of the records have been replaced with some typos and letters (Figure 14). Upon closer investigation, the posts themselves have text messages that are not significant towards the analysis. Therefore, these data records have been imputed with dummy dates.

Besides the Created_at field, there are also many typo errors and text in different languages within the text field. This field records all the actual microblogging messages written by different people. The assumption is that a certain amount of such errors are permissible in the analysis. They will show up on the geographical maps as independent data points and will be automatically dismissed by manual observation.




Data Exploration

Figure 15: Exploring Data using JMP Pro Graph Builder

Various relationships between data have been explored using preliminary visualization. This is to give a better idea on how each type of data interacts with one another and to see some nuances that cannot be understood by looking at the raw data alone. Figure 15 shows the various population densities and day time population densities of each Smartpolis area. One can easily tell that both Westside and Downtown areas should be commercial areas, since there is a huge influx of people coming to these areas during the daytime most likely to work.






Data Extraction

Figure 16: JMP Pro screen capture of Location Field and formulae used to derive Latitude and Longitude.

Since the Location field has latitude and longitude information, it will be useful to extract these two pieces of information out. Figure 16 shows the formulae used to separate these fields. The latitude and longitude will eventually be used to plot the various microblog message data plots onto the Smartpolis map to visualize their respective locations.





Time-scale Data Binning

Figure 17: JMP Pro screen capture of Created_at Field and formulae used to derive Created_at_binned field.

The microblogging records in the given data is precise to the nearest minute. But for performing the visual analysis, it will be more useful if we can aggregate the data into 15 minute intervals during visualization. A new field Created_at_binned (Figure 17) was thus created using formula to round the timing of each microblog message to the nearest 15th minute mark.






Keywords Selection

The initial keywords used to segregate the data were derived from the given symptoms of the outbreak. These keywords included terms such as:

Aches Breath Chills Cold
Cough Diarrhea Fatigue Fever
Flu Hospitalized Lymph Node Nausea
Pains Pneumonia Sick Stomach
Sweats Vomit

These terms were useful in displaying the location of the microblogs discussing these symptoms on the geographic scatterplot and allowed for further investigation on the possible causes.


Text Analysis using JMP Pro 13.0 Text Explorer

Figure 18: JMP Pro screen capture of Text Explorer for the phrase 'Truck Accident'.

Using JMP Pro 13.0 Text Explorer, analysis also revealed certain notable events that have occurred during the epidemic period. These discovered notable events have also been mentioned under the Home link. Figure 18 shows an example of using the text explorer to see the text "Truck Accident" in greater detail.

JMP Pro Text Explorer arranges the terms and phrases used within the microblog messages in descending order. So it is much easier to click on the phrases one by one to check out what the data consists of, and decide whether they are relevant to the analysis. Some of these terms and phrases are also added to the list of keywords used for further analysis purposes.


Data Exclusion

After exploring the data with various visualizations, some identified keywords may not have sufficient information, or they might cover too wide an area and do not make sense in the analysis. These keywords are further excluded. One example of such a keyword would be "Lymph Node". Even though it was mentioned as a symptom in the outbreak, a closer examination revealed that very few microblog messages even contain the phrase at all, so it was excluded in the analysis.

The remaining included keywords have a more significant usage in determining the origin of spread and the transmission methods.

Within the significant keywords, there are also reports made by close relatives of the infected victims. These do not represent the actual numbers of true victims and hence have been excluded as well. The method of filtering is to use built-in JMP Pro 13.0 Row Selection function, and entering in the keyword of choice. Once the rows have been selected, they are then extracted out into a separate Data View to look for commonalities within the microblog texts. Very often, there are multiple common terms used to denote whether a microblog message is describing a victim or it is actually a personal account of the victim's suffering. These common terms are then used to filter out the non-personal accounts for a truer representation of data for analysis.


Visualization Experimentation

After much research on JMP Pro 13's Bubble Plot and Tableau's Pages Function, various interactive graphs have been created to visualize the epidemic data. Static scatterplots and bar charts have also been used to showcase some of the trends and data for comparison.


Acknowledgements

Special Thanks To:

1. Prof. Kam Tin Seong - For his guidance and patience


2. SMU MITB Batch Mates for the blood and tears during our discussions:

  • Deng Yuetong
  • He Ziwen
  • Matilda Tan Ying Xuan
  • Rachel Tong


Please visit their respective webpages:


3. YOU, the viewer, for viewing my page.


Feedback

For any feedback or comments, please contact me at:

guoteng.fam.2016@mitb.smu.edu.sg


Reference List

  1. SAS Enterprise. (n.d.). Graphical Displays and Summaries. Retrieved from https://www.jmp.com/en_us/learning-library/graphical-displays-and-summaries.html
  2. SAS Enterprise. (n.d.). JMP Flash Bubble Plot. Retrieved from http://www.jmp.com/support/swfhelp/en/bubbleplot.shtml
  3. Field Epidemiology Manual. (n.d.). Types of Outbreak. Retrieved from https://wiki.ecdc.europa.eu/fem/w/wiki/types-of-outbreak
  4. Centers for Disease Control and Prevention. (2012, May 18). Principles of Epidemiology in Public Health Practice, Third Edition An Introduction to Applied Epidemiology and Biostatistics. Retrieved from https://www.cdc.gov/ophss/csels/dsepd/ss1978/lesson1/section11.html



Containment_and_Resource_Deployment