ISSS608 2017-18 T1 Assign GOH JUN JIE ANTHONY

From Visual Analytics and Applications
Revision as of 21:11, 15 October 2017 by Anthony.goh.2016 (talk | contribs)
Jump to navigation Jump to search

Background

Smartpolis is a major metropolitan area with a population of approximately two million residents. During the last few days, health professionals at local hospitals have noticed a dramatic increase in reported illnesses.

Observed symptoms are largely flu­like and include fever, chills, sweats, aches and pains, fatigue, coughing, breathing difficulty, nausea and vomiting, diarrhea, and enlarged lymph nodes. More recently, there have been several deaths believed to be associated with the current outbreak. City officials fear a possible epidemic and are mobilizing emergency management resources to mitigate the impact.

Two datasets have been provided. The first one contains microblog messages collected from various devices with GPS capabilities. These devices include laptop computers, handheld computers, and cellular phones. The second one contains map information for the entire metropolitan area. The map dataset contains a satellite image with labeled highways, hospitals, important landmarks, and water bodies. Supplemental tables for population statistics and observed weather data are also provided.

We are tasked with the following:

  1. Identify approximately where the outbreak started on the map (ground zero location), outline the affected area and explain how we arrived at the conclusion.
  2. Present a hypothesis on how the infection is being transmitted, e.g. whether the method of transmission is person-­to­-person, airborne, waterborne etc., and identify the trends that support our hypothesis.
  3. Advise whether the outbreak is contained and whether it is necessary for emergency management personnel to deploy treatment resources outside the affected area, and explain our reasoning.


Data Preparation

In the Microblogs.csv file, the attributes given are ID, Created_at, Location and Text. For the "Location" attribute, the latitude and longitude coordinates were combined in one column and I have to separate it for subsequent use in Tableau. The following functions were used in Excel to split the latitude and longitude:

  • LEFT(C2, SEARCH(" ",C2,1))
  • RIGHT(C2,LEN(C2)-SEARCH(" ",C2,1))

For the longitude, the README file indicated that it is West so I added a negative sign to the longitude coordinates.

There are 1,023,077 records in the Microblogs.csv file. I will need to identify relevant messages which will aid us in identifying the affected area of the disease. To do that, I will use the Text Explorer function in JMP Pro. The Text Explorer will list the most commonly used terms and phrases and I will select terms and phrases linked to the illness and symptoms, e.g. "sick", "headache", "case of the chills", "sick sucks".

Common Terms and Phrases

After selecting the relevant terms and phrases, we made them into a data table and saved the file as a SAS data set. There are 69,729 messages now compared to the earlier 1,023,077.


Data Visualisation

The SAS data set was imported into Tableau.

As Smartpolis is a fictional location, I inserted the Smartpolis_Map.png file as a background image in Tableau. The "Longitude" field was placed under Columns and the "Latitude" field was placed under Rows. The "Created at" field was placed under Pages and "Hour" was selected. We can see from the image below that at the peak of the outbreak, many points are cluttered in the middle.

Many points are cluttered in the middle

To see the intensity of the records more clearer, I did hexagonal binning by creating calculated fields using the hexbinx and hexbiny functions in Tableau. The "Number of Records" field was placed under Size so that bins with higher intensity of the records will appear as bigger circles.

Higher intensity of records will appear as bigger circles


Origin of Outbreak

By plotting the "Number of Records" against time (hour), we can see that the number of messages rose sharply from May 18, 1 am and peaked at 6 pm. There were 1,810 messages at May 18, 6 pm while previous days all had less than 100 each hour.

Number of messages increased sharply from May 18, 1 am