ISSS608 2017-18 T1 Assign WANG SHANG

From Visual Analytics and Applications
Revision as of 14:44, 15 October 2017 by Shang.wang.2016 (talk | contribs)
Jump to navigation Jump to search

Title WangShang.jpg Mini Challenge: What's happened in Smartpolis?

Background

Smartpolis is a major metropolitan area with a population of approximately two million residents. During the last few days, health professionals at local hospitals have noticed a dramatic increase in reported illnesses.

I want to mine some valuable insights to track the trend of spread of illness by using visualization analysis tools, and help government to let them know what they can do for a better illness spread control.


Data Description

I have three datasets and one Smartpolis map for analysis. In the three datasets, the first one contains microblog messages collected from various devices with GPS capabilities. These devices include laptop computers, handheld computers, and cellular phones, another two are about population statistics and observed weather data. I am also supported some additional information in a Words file.


Data Preparation

In the microblog dataset, there is a column that records the text that is published to social platform by different persons, and this dataset also supports the created time and location to me. I import this dataset to JMP, using word function split the location data into two columns, latitude and longitude. Then I use text explore analysis to split each text record into words and phrases with no stemming. Because I think if someone is ill, he/she usually sends a blog message about his/her illness. So that if I can find a word that can represent a symptom or illness in a text, it probability means this blog creator has gotten this illness. Hence, I can just extract a key symptom to represent the current status of a person.

Here is an example, I use flu­like, fever, chills,sweats, aches and pains, fatigue, coughing, breathing difficulty, nausea, vomiting, diarrhea, and enlarged lymph nodes, which is provided in the overview part of assignment introduction page, as my illness word list. And I find each word from this list in JMP text explore analysis to collect related text records and put them into a new table. In this new table, I create a new column called Key_Symptom using the particular words as the value.

pic1. finding key symptom

After finishing the same process on the all words in my illness word list, I concentrate them together to generate my visualize-used table. Before I import them into Tableau, I also create a new column named DayNight based on Created_at column. In this column, value "1" means Day, because the hour of created time between 6 and 17. and value "2" means Night, because the hour of created time less than 6 or larger than 17. So far, the data preparation has been finished. I will use it and weather and population data to do a visualization analysis.


Tasks & Solutions

Task 1: Origin and Epidemic Spread

As the below picture showed, in my opinion, the zero ground location is around the place in the red circle, and there are two affected region, they are the two yellow circle regions next to the red circle.

pic2. zero ground and affected place

After data preparation, I input the new microblog data into Tableau. First of all, in DayNight column, I recode the value "1" to Day, and value "2" to Night. Then, I insert the map image based on latitude and longitude columns, using different colors to identify different symptoms.

I think if I want to know where is the zero ground and affected place, I need to find a place that has much more points than before showed on map. I use Day of Created_at column as my filter and put it into Pages part to show the distribution of points for each day. The situation is normal before 18th May, and on 18th, there are a lot of points suddenly showed on the downtown and uptown region. And just one day later, there are a lots of points suddenly showed again on another region, the downstream of the river.

pic3. 18th outbreak
pic4. 19th outbreak

So, the two affected places is around downtown and uptown region and the banks of downstream of river.

I also want to know what exact kind of symptom that people got. So I count the number of each symptom and get the below graph.

pic5. The number of symptom

I find that headache, breathing problem, chill, cough, fatigue, fever and sweat outbreak on 18th. Diarrhea, flu, nausea and pain outbreak on 19th. Vomit outbreaks on 20th. I also visualize the distribution of point for each symptom, and find that headache, breathing problem, chill, cough, fatigue, fever, sweat and flu are major in the center region (where is downtown and uptown region). That means most of people who had the symptoms that outbreak on 18th got flu on next day. So the illness in this region possible is flulike illness.

And Diarrhea, nausea, and vomit are major in the banks of downstream of river. Same as flu, I think outbreak of vomit symptom just delays because of the reaction time of human body. And based on the words of two region's symptom, they are obviously two different illness. Flulike illness outbreaks in center region and stomach related problem outbreaks in the banks of downstream of river. (Below is the picture of point distribution for each symptom)

pic6. Point distribution on 18th
pic7. Point distribution on 19th and 20th

Notice that symptom pain outbreaks both center region and banks of downstream of river. I think it because no matter flulike illness or stomach problem, people will always feel painful. So I decide to remove it from my further analysis. Actually, headache also has same problem, but it because I added symptom ache into headache, I think it is the mainly reason that why lots of point presented near to the banks of downstream of river. So I decide to keep headache.

After finding affected place, I can find the zero ground based on what I have found. Firstly, for stomach related problem, depends on the nature of the symptom and the outbreak region where is near the water, I think people got this illness because of the river. Then, according to README file, in additional information, there is an item said the river flows to south. This is also why I call this region is downstream of river. So, there must be something in the water flows to downstream from upstream that caused the illness. Secondly, for center region, the symptoms contain breathing problem, cough, flu and so on. I think maybe there is something wrong in the air. From the weather data, on both 17th and 18th the direction of wind is from west to east. It means in the west part of center region, something happened and polluted air. In a word, combine two possible reason of two affected places, I think the zero ground is around the highway 601, which is in the red circle that I said before. So, I do another text explore analysis to mine the text data 17th and 18th. The result shows that word fire has the most count amount. So I guess there is something happened that caused a fire, the fire lead to a air pollution and make lots of dirty things into river, which results to two regions' illness.


Task 2: Epidemic Spread

Question 1: how the infection is being transmitted?

For this case, there are three ways that can spread illness.

  1. Person-to-person
  2. Airborne
  3. Waterborne

I think that because the illnesses of two affected regions are different, I can analyze them individually.

Banks of downstream of river

For this region, the illness is stomach related problem, as what I said in task 1, I think the illness in this region is probability transmitted by water. Because people usually get stomach problem when they eat something wrong. Here, I cannot collect any information about the food, and this region is so near to the river bank. So water should be the main object that can transmitted illness. Below is the stomach related points distribution on 19th and 20th.