Difference between revisions of "ISSS608 2017-18 T1 Assign GOH JUN JIE ANTHONY"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
<big>'''Background'''</big>
+
<big><big>'''Background'''</big></big>
  
 
Smartpolis is a major metropolitan area with a population of approximately two million residents. During the last few days, health professionals at local hospitals have noticed a dramatic increase in reported illnesses.<br />
 
Smartpolis is a major metropolitan area with a population of approximately two million residents. During the last few days, health professionals at local hospitals have noticed a dramatic increase in reported illnesses.<br />
Line 14: Line 14:
  
  
<big>'''Data Preparation'''</big>
+
<big><big>'''Data Preparation'''</big></big>
  
In the Microblogs.csv file, the attributes given are ID, Created_at, Location and Text. For the "Location" attribute, the latitude and longitude coordinates were combined in one column. To separate it, I used '=LEFT(A2, SEARCH(" ",A2,1)), '=RIGHT(A2,LEN(A2)-SEARCH(" ",A2,1))
+
In the Microblogs.csv file, the attributes given are ID, Created_at, Location and Text. For the "Location" attribute, the latitude and longitude coordinates are combined in one column and I have to separate it for subsequent use in Tableau. The following formulae are used in Excel to split the latitude and longitude:
  
there are 1,023,077 records. We need to identify relevant messages which will aid us to identify the affected area of the disease.<br />
+
* LEFT(C2, SEARCH(" ",C2,1))
 +
* RIGHT(C2,LEN(C2)-SEARCH(" ",C2,1))
  
To do that, we have to
+
For the longitude, the README file indicated that it is West so I added a negative sign to the longitude coordinates.
 +
 
 +
There are 1,023,077 records in the Microblogs.csv file. I will need to identify relevant messages which will aid in identifying the affected areas of the disease. To do that, I will use the Text Explorer function in JMP Pro. The Text Explorer will list the most commonly used terms and phrases and I will select terms and phrases linked to the illness and symptoms, e.g. "sick", "headache", "case of the chills", "sick sucks".
 +
 
 +
[[File:Text Explorer.png|none|Common Terms and Phrases]]
 +
 
 +
After selecting the relevant terms and phrases, I made them into a data table and saved the file as a SAS data set. There are 69,729 messages now compared to the earlier 1,023,077.
 +
 
 +
 
 +
<big><big>'''Data Visualisation'''</big></big>
 +
 
 +
The SAS data set was imported into Tableau.
 +
 
 +
As Smartpolis is a fictional location, I inserted the Smartpolis_Map.png file as a background image in Tableau. The "Longitude" field was placed under Columns and the "Latitude" field was placed under Rows. The "Created at" field was placed under Pages and "Hour" was selected. We can see from the image below that at the peak of the outbreak, many points are cluttered in the middle.
 +
 
 +
[[File:Tableau1.png|Many points are cluttered in the middle]]
 +
 
 +
To see the intensity of the records more clearly, I did hexagonal binning by creating calculated fields using the hexbinx and hexbiny functions in Tableau. The "Number of Records" field was placed under Size so that bins with higher intensity of the records will appear as bigger circles.
 +
 
 +
[[File:Tableau2.png|Higher intensity of records will appear as bigger circles]]
 +
 
 +
 
 +
<big><big>'''Origin of Outbreak'''</big></big>
 +
 
 +
By plotting the "Number of Records" against time (hour), we can see that the number of messages rose sharply from May 18, 1 am and peaked at 6 pm. There were 1,810 messages at May 18, 6 pm while previous days all had less than 100 per hour.
 +
 
 +
[[File:Tableau3.png|Number of messages increased sharply from May 18, 1 am]]
 +
 
 +
We can see from the image below that at May 18, 12 am, there were not many messages posted. However, at 1 am, there was a large spike in the number of messages from Downtown and Uptown. Using the Lasso Selection function, I note that the number of messages from Downtown and Uptown increased from 8 at 12 am to 77 at 1 am. The number of messages increased even more at 8 am, when people woke up. At 8 am, 596 of the 810 messages (74%) were from Downtown and Uptown.
 +
 
 +
[[File:Tableau4.png|Everything was normal at May 18, 12 am]]
 +
 
 +
[[File:Tableau5.png|Spike in messages from Downtown and Uptown at 1 am]]
 +
 
 +
[[File:Tableau6.png|Greater spike in messages from Downtown and Uptown at 8 am]]
 +
 
 +
From the above, we can deduce that the ground zero location and affected areas is Downtown and Uptown.
 +
 
 +
 
 +
<big><big>'''Epidemic Spread'''</big></big>
 +
 
 +
 
 +
<big>'''By Air'''</big>
 +
 
 +
From the Weather.csv file, we see that the wind was blowing from the west on May 18. If the infection was transmitted airborne, we should expect that people in Eastside and some parts of Surburbia and Lakeside to be infected. However, from the image below taken at May 18, 11 pm, we do not see a large spike in the number of messages from Eastside. A point to note when I analysed the messages was that many of the messages mentioned about other people who were sick. Therefore, messages from the town would not necessarily mean that people in the town were infected. If people in Eastside were infected, the size of the circle should be almost as big as the one in Downtown and Uptown.
 +
 
 +
[[File:Tableau7.png|No spike in messages from Eastside]]
 +
 
 +
From the above, the infection should not be spread by air.<br />
 +
 
 +
 
 +
<big>'''By Person to Person'''</big>
 +
 
 +
From the Population.csv file, we see that Downtown had 89,286 residents but its daytime population was much higher at 258,928. Many people travelled to Downtown to work in the day. The same goes for Uptown. It had 29,762 residents but its daytime population was 116,072. Noting that many people would go to Downtown and Uptown everyday, we would expect the number of infections to rise exponentially if the disease can be spread by person to person. However, based on the 2 images below taken at May 19 and 20, 11 pm, there was no evidence of increase in the intensity of messages from the other towns.
 +
 
 +
[[File:Tableau8.png|No spike in messages from other towns on May 19]]
 +
 
 +
[[File:Tableau9.png|No spike in messages from other towns on May 20]]
 +
 
 +
The infection should also not be spread by person to person.
 +
 
 +
 
 +
<big>'''By Water'''</big>
 +
 
 +
Residents and businesses get their drinking water by pumping water from nearby reservoirs or rivers. Noting that the infection was contained within Downtown and Uptown from May 18 to May 20, it is likely that the infection was transmitted by water. There was a sudden spike in messages at May 18, 1 am and there was an even larger spike in messages at 8 am when people woke up. There could be contamination of the reservoir or river water before May 18, 1 am and this resulted in the spike of messages. When more people drank the contaminated water in the morning, the number of messages increased drastically.
 +
 
 +
 
 +
<big>'''Containment'''</big>
 +
 
 +
We can see from the image below that the number of messages peaked on May 18, 8 am (1,810) and the number of messages stabilised on May 19 and 20 (max of about 1,000 in an hour). This would mean that the number of people infected did not increase and the infection was once-off on May 18. As this group of people would need time to recover, the messages would not disappear overnight. The data was only till May 20, 11 pm but I would expect that the number of messages to dwindle as the days went by.
 +
 
 +
[[File:Tableau3.png|Number of messages stabilised on May 19 and 20]]
 +
 
 +
Using the Lasso Selection function, I noted that the number of messages from Downtown and Uptown dropped from 427 (May 18, 11 pm) to 307 (May 20, 11 pm). This is a sign that people were recovering and the outbreak was contained. It is not necessary for emergency management personnel to deploy treatment resources outside the affected area as the evidence above showed that the infection did not spread to the other towns.
 +
 
 +
 
 +
<big><big>'''URL link to the web-based interactive data visualisation'''</big></big>
 +
 
 +
 
 +
https://public.tableau.com/profile/goh.jun.jie.anthony#!/vizhome/Assignment_241/Map?publish=yes
 +
 
 +
https://public.tableau.com/profile/goh.jun.jie.anthony#!/vizhome/Assignment_241/Timeline?publish=yes

Latest revision as of 23:54, 15 October 2017

Background

Smartpolis is a major metropolitan area with a population of approximately two million residents. During the last few days, health professionals at local hospitals have noticed a dramatic increase in reported illnesses.

Observed symptoms are largely flu­like and include fever, chills, sweats, aches and pains, fatigue, coughing, breathing difficulty, nausea and vomiting, diarrhea, and enlarged lymph nodes. More recently, there have been several deaths believed to be associated with the current outbreak. City officials fear a possible epidemic and are mobilizing emergency management resources to mitigate the impact.

Two datasets have been provided. The first one contains microblog messages collected from various devices with GPS capabilities. These devices include laptop computers, handheld computers, and cellular phones. The second one contains map information for the entire metropolitan area. The map dataset contains a satellite image with labeled highways, hospitals, important landmarks, and water bodies. Supplemental tables for population statistics and observed weather data are also provided.

We are tasked with the following:

  1. Identify approximately where the outbreak started on the map (ground zero location), outline the affected area and explain how we arrived at the conclusion.
  2. Present a hypothesis on how the infection is being transmitted, e.g. whether the method of transmission is person-­to­-person, airborne, waterborne etc., and identify the trends that support our hypothesis.
  3. Advise whether the outbreak is contained and whether it is necessary for emergency management personnel to deploy treatment resources outside the affected area, and explain our reasoning.


Data Preparation

In the Microblogs.csv file, the attributes given are ID, Created_at, Location and Text. For the "Location" attribute, the latitude and longitude coordinates are combined in one column and I have to separate it for subsequent use in Tableau. The following formulae are used in Excel to split the latitude and longitude:

  • LEFT(C2, SEARCH(" ",C2,1))
  • RIGHT(C2,LEN(C2)-SEARCH(" ",C2,1))

For the longitude, the README file indicated that it is West so I added a negative sign to the longitude coordinates.

There are 1,023,077 records in the Microblogs.csv file. I will need to identify relevant messages which will aid in identifying the affected areas of the disease. To do that, I will use the Text Explorer function in JMP Pro. The Text Explorer will list the most commonly used terms and phrases and I will select terms and phrases linked to the illness and symptoms, e.g. "sick", "headache", "case of the chills", "sick sucks".

Common Terms and Phrases

After selecting the relevant terms and phrases, I made them into a data table and saved the file as a SAS data set. There are 69,729 messages now compared to the earlier 1,023,077.


Data Visualisation

The SAS data set was imported into Tableau.

As Smartpolis is a fictional location, I inserted the Smartpolis_Map.png file as a background image in Tableau. The "Longitude" field was placed under Columns and the "Latitude" field was placed under Rows. The "Created at" field was placed under Pages and "Hour" was selected. We can see from the image below that at the peak of the outbreak, many points are cluttered in the middle.

Many points are cluttered in the middle

To see the intensity of the records more clearly, I did hexagonal binning by creating calculated fields using the hexbinx and hexbiny functions in Tableau. The "Number of Records" field was placed under Size so that bins with higher intensity of the records will appear as bigger circles.

Higher intensity of records will appear as bigger circles


Origin of Outbreak

By plotting the "Number of Records" against time (hour), we can see that the number of messages rose sharply from May 18, 1 am and peaked at 6 pm. There were 1,810 messages at May 18, 6 pm while previous days all had less than 100 per hour.

Number of messages increased sharply from May 18, 1 am

We can see from the image below that at May 18, 12 am, there were not many messages posted. However, at 1 am, there was a large spike in the number of messages from Downtown and Uptown. Using the Lasso Selection function, I note that the number of messages from Downtown and Uptown increased from 8 at 12 am to 77 at 1 am. The number of messages increased even more at 8 am, when people woke up. At 8 am, 596 of the 810 messages (74%) were from Downtown and Uptown.

Everything was normal at May 18, 12 am

Spike in messages from Downtown and Uptown at 1 am

Greater spike in messages from Downtown and Uptown at 8 am

From the above, we can deduce that the ground zero location and affected areas is Downtown and Uptown.


Epidemic Spread


By Air

From the Weather.csv file, we see that the wind was blowing from the west on May 18. If the infection was transmitted airborne, we should expect that people in Eastside and some parts of Surburbia and Lakeside to be infected. However, from the image below taken at May 18, 11 pm, we do not see a large spike in the number of messages from Eastside. A point to note when I analysed the messages was that many of the messages mentioned about other people who were sick. Therefore, messages from the town would not necessarily mean that people in the town were infected. If people in Eastside were infected, the size of the circle should be almost as big as the one in Downtown and Uptown.

No spike in messages from Eastside

From the above, the infection should not be spread by air.


By Person to Person

From the Population.csv file, we see that Downtown had 89,286 residents but its daytime population was much higher at 258,928. Many people travelled to Downtown to work in the day. The same goes for Uptown. It had 29,762 residents but its daytime population was 116,072. Noting that many people would go to Downtown and Uptown everyday, we would expect the number of infections to rise exponentially if the disease can be spread by person to person. However, based on the 2 images below taken at May 19 and 20, 11 pm, there was no evidence of increase in the intensity of messages from the other towns.

No spike in messages from other towns on May 19

No spike in messages from other towns on May 20

The infection should also not be spread by person to person.


By Water

Residents and businesses get their drinking water by pumping water from nearby reservoirs or rivers. Noting that the infection was contained within Downtown and Uptown from May 18 to May 20, it is likely that the infection was transmitted by water. There was a sudden spike in messages at May 18, 1 am and there was an even larger spike in messages at 8 am when people woke up. There could be contamination of the reservoir or river water before May 18, 1 am and this resulted in the spike of messages. When more people drank the contaminated water in the morning, the number of messages increased drastically.


Containment

We can see from the image below that the number of messages peaked on May 18, 8 am (1,810) and the number of messages stabilised on May 19 and 20 (max of about 1,000 in an hour). This would mean that the number of people infected did not increase and the infection was once-off on May 18. As this group of people would need time to recover, the messages would not disappear overnight. The data was only till May 20, 11 pm but I would expect that the number of messages to dwindle as the days went by.

Number of messages stabilised on May 19 and 20

Using the Lasso Selection function, I noted that the number of messages from Downtown and Uptown dropped from 427 (May 18, 11 pm) to 307 (May 20, 11 pm). This is a sign that people were recovering and the outbreak was contained. It is not necessary for emergency management personnel to deploy treatment resources outside the affected area as the evidence above showed that the infection did not spread to the other towns.


URL link to the web-based interactive data visualisation


https://public.tableau.com/profile/goh.jun.jie.anthony#!/vizhome/Assignment_241/Map?publish=yes

https://public.tableau.com/profile/goh.jun.jie.anthony#!/vizhome/Assignment_241/Timeline?publish=yes