Difference between revisions of "ISSS608 2017-18 T1 Assign XING SIYUAN Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
(Created page with "<div style=background:#FFC0CB border:#A3BFB1> 165px <b><font size = 5; color="#8B0000"> Epidemic Spread in Smartpolis - Origin and Transmission </font></b...")
 
 
(4 intermediate revisions by the same user not shown)
Line 11: Line 11:
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#FFC0CB; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#FFC0CB; text-align:center;" width="25%" |  
 
;
 
;
[[ISSS608_2017-18_T1_Assign_XING_SIYUAN_Data_Preparation|<b><font size="2"><font color="#8B0000">Data Preparation</font></font></b>]]
+
[[ISSS608_2017-18_T1_Assign_XING_SIYUAN_Data_Preparation|<b><font size="2"><font color="#8B0000">Data Preparation & Dashboard Design</font></font></b>]]
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#FFC0CB; text-align:center;" width="25%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#FFC0CB; text-align:center;" width="25%" |  
Line 34: Line 34:
 
</tr>
 
</tr>
 
<tr>
 
<tr>
<td><b> 1.Data Cleaning</b>
+
<td width=40%><b> 1.Data Cleaning</b>
 
<br>Tools: JMP
 
<br>Tools: JMP
 
<br>Method:  
 
<br>Method:  
Line 42: Line 42:
 
<br>4. Export table into CSV format.
 
<br>4. Export table into CSV format.
 
</td>
 
</td>
<td>[[File:SY_Clean_data.png|500px|center]]</td>
+
<td width=60%>[[File:SY_Clean_data.png|500px|center]]</td>
 
</tr>
 
</tr>
  
 
<tr>
 
<tr>
<td><b> 2.Identify infected user </b>
+
<td><b> 2.Identify infected patients </b>
 
<br>Tools: Tableau
 
<br>Tools: Tableau
<br>From the  
+
<br>By loading the cleaned data into Tableau, we can draw a heat map to visualize the macroblog density per day in each location. From the heatmap of number of macroblogs, we know that there is a huge increase in the number of macroblogs posted on 19th & 20th of May. There must be some major events that caused the increase of macroblogs.
<br>
 
 
<br>
 
<br>
 +
<br>From the locations of macroblogs posted on 19th and 20th of May (as shown in the left figue), it is obvious that there is a high density of macroblogs around the hospital of Smartpolis (highlighted with black square). Which means these posts has a high possibility that is being posted by people who has been infected by the epidemic. By investigate what has those people posted and where has those people been to in the last few day can help us find where the outbreak started, how the infection is being transmitted and measure whether the outbreak is contained or not.
 +
<br>Select the macroblogs on map where the location is around the hospitals. Group user_id of these posts and create a set named patients. Extract a csv file that contains ID of all the people in patients set.
 +
</td>
 +
<td>
 +
Heatmap of Number of Macroblogs by days:
 +
[[File:SY_num_dis.png|200px|center]]
 +
Macroblogs distribution in the last day:
 +
[[File:SY_patients.png|500px|center]]
 
</td>
 
</td>
<td>[[File:gyf_m_11.png|500px|center]]</td>
 
 
</tr>
 
</tr>
  
 
<tr>
 
<tr>
<td><b> 3.Understand the paths</b>
+
<td><b> 3.Identify Symptom of Infected Patients - Data Preparation</b>
<br>The data provided is all about the names of the sensors taking the readings (gate-names) and the date and time when the readings were taken (timestamps). This provides valuable information on the paths adopted by different vehicles inside the preserve – where did they go and what time did they go. Thus, one of the core patterns of life analysis will be centered around the paths adopted by different types of visitors identified in 1.
+
<br>Tools: JMP
 +
<br>Load patients ID file into JMP and join it with Macroblogs table. With text explorer of JMP, the top mentioned words and phrase posted by infected people are generated (left top figure). By filtering the words that are related to symptoms of the epidemic, we can tell that most patients were suffering a fever, cough, headache, diarrhea, vomit, sore throat, aching muscles, runny nose, difficulty in breath and so on.
 +
<br>
 +
<br>By further investigating in the symptoms, it seems that the symptoms can be clustered into two categories, one related to gastrointestinal discomfort, the other related to inhalation discomfort. Hence, it is possible that the epidemic contains two type of diseases and may has two origins and multiple transmission methods. We chose 7 words from inhalation symptoms and 4 from gastrointestinal symptoms (shown on the left middle table) to identify origin and transmission method of the epidemic.
 +
<br>
 +
<br> Create 11 columns with col_name of the 11 words selected, check if the text in each row contains the corresponding words. If yes, out put 1. If no, output 0.
 +
</td>
 +
<td>
 +
Words detection:
 +
{| class="wikitable"
 +
|-
 +
! Text Explorer
 +
|-
 +
| [[File:SY_words.png|250px|center]]
 +
|}
 +
Words table:
 +
{| class="wikitable"
 +
|-
 +
! Symptom Type !! Words
 +
|-
 +
| Inhalation || <b>chill, flu, sore throat, breath, pneumonia, fever, cough</b>
 +
|-
 +
| Gastrointestinal || <b>stomachache, diarrhea, vomit, nausea</b>
 +
|}
 +
Check Symptoms:
 +
{| class="wikitable"
 +
|-
 +
! Formula
 +
|-
 +
| [[File:SY_formula.png|200px|center]]
 +
|}
  
<br>Two terminologies used in the analysis need to be clarified before we move on to the analysis result, paths vs sequences.
+
</td>
 +
</tr>
  
<br>In this analysis, a <b>path</b> is defined as a series of categories of gates visited by different vehicles in a chronological order. The path analysis looks at the movement in the reserve at a broader level, for example, how many visitors following the movement pattern of entrance->camps->entrance. In this case, a vehicle moves from entrance1 ->camp1->entrance3 will be taken as being traveled along the same path as a vehicle which moves from entrance2 ->camp1->entrance4. Here, we are interested in HOW the vehicles move around the park.
+
<tr>
 +
<td><b> 4.Identify Origin of the Epidemic - Data Visulazation</b>
 +
<br>Tools: Tableau
 +
<br>To identify origin of the epidemic, we need to identify both when and where the symptoms start to appeal.
 +
<br>First, load the newly created table into tableau and join it with the original data with id and time of each row.
 +
To identify start time of the epidemic, we can use line chart to visualize the change in frequency of the symptoms across time (as shown on left). The pink charts represents the frequency of inhalation symptoms and the yellow charts represents the frequency of gastrointestinal symptoms.
 +
<br>To identify start location of the epidemic, we can visualize the location of macroblogs that contains the symptoms. To make the visualization more interactive, we can write formula in Tableau to group the symptoms in to inhalation and gastrointestinal. Then add inhalaton and gastrointestinal into filter, so that we can see origin of both type of disease.
 +
</td>
 +
<td>
 +
Visualization:
 +
{| class="wikitable"
 +
|-
 +
! Appeal Frequency !! Appeal Location
 +
|-
 +
| [[File:SY_02.png|200px|center]] || [[File:SY_03.png|300px|center]]
 +
|}
  
<br>On the other hand, a <b>sequence</b> is defined as the series of individual gates visited by different vehicles in the chorological order. Taking the same path example given above, now in the sequence analysis we’ll consider the two vehicles as being travelled by different sequences. Now we are interested in WHERE the vehicles move around the park at different time periods.
 
 
</td>
 
</td>
<td>
 
The paths:
 
[[File:gyf_m_2.png|500px|center]]
 
The sequences:
 
[[File:gyf_m_3.png|500px|center]]</td>
 
 
</tr>
 
</tr>
  
 
<tr>
 
<tr>
<td><b> 4.Interpretations on "patterns of life" analysis </b>
+
<td><b> 5.Identify Major Events in Smartpolis </b>
<br>For the daily patterns, we look at the activities of different types of vehicles at different hours of the day – what time they go to the reserve, what time they leave and where do they go at different hours.
+
<br>Tools:Tableau, JMP
 +
<br>The epidemic broke on 18 May, it is highly possible that people get infected on 17 May. Hence, it is valuable to investigate on where did people go on 17 May and is there any major event happened in 17 May that spread the disease?
 +
<br>
 +
<br>Use Tableau filter out all the macroblogs posted on 17 May, there seems to an unusual density change at the intersection of red road and Vast river. select points around the area and create a set called event. Export user ID of the event set.
 +
<br>
 +
<br>In JMP, join the event table with macroblogs table. Filter out the row that are being posted on 17 May. From the output of text explorer, it seems that a terrible truck accident happened on 17 May around the intersection of Westside, Northville, Downtown and Plainville.
 +
<br>To further explore what are the major events happened in Smartpolis, we did text analysis for macroblogs post in each week (as shown on left). It seems that on 6 May, there is a car accident happened alone the boundary of Riverside and Suburbia. On 7 May, there was a music festival held in Downtown area. On 17 May, there was a truck accident happened in the intersection area of road 610 and Vast river. The first two events are held far before the break out of the epidemic, hence, the events shall had no effect on spread of the epidemic.
 +
</td>
 +
<td>
  
<br>For the longer period patterns, we look at the activities of different types of vehicles at specific day of week, day of month or months.
 
  
<br>For the visualization design, an interactive date parameter is introduced to allow the users to switch between different time periods to be used for the analysis easily.
+
Event detection:
 +
{| class="wikitable"
 +
|-
 +
! Group  !! Text Exployer
 +
|-
 +
| [[File:SY_18.png|300px|center]] || [[File:SY_17.png|300px|center]]
 +
|}
 +
 
 +
Week key words explore:
 +
{| class="wikitable"
 +
|-
 +
! Week 1 !! Week 2 !! Week 3
 +
|-
 +
| [[File:week1.png|200px|center]] || [[File:week2.png|200px|center]] || [[File:week3.png|200px|center]]
 +
|}
 +
 
 +
Key Events
 +
{| class="wikitable"
 +
|-
 +
! Music Festival !! Car Accident !! Truck Accident
 +
|-
 +
| 7 May || 6 May || 17 May
 +
|-
 +
| [[File:SY_26.png|200px|center]] || [[File:SY_27.png|200px|center]] || [[File:SY_28.png|200px|center]]
 +
|}
 
</td>
 
</td>
<td>[[File:gyf_m_4.png|500px|center]]</td>
 
 
</tr>
 
</tr>
  
<tr>
 
<td><b> 5.Overall visualization design concepts </b>
 
<br>6 types of visitors, 5 categories of gates and 3 time intervals for analysis – this challenge deem to be an intriguing one. Thus, the visualizations were designed make full use of Tableau’s interactivity such that we can provide a tool for the users to customize their analysis as much as possible. The users are able to filter the different input parameters (visitor type, gates, time), choose the extent of animation (for the paths) and choose what is the area of focus for the analysis by playing with the colors. This will be explained in greater details in the dashboard designs section.
 
</td>
 
  
</tr>
 
 
</table>
 
</table>

Latest revision as of 20:53, 16 October 2017

Skull.jpg Epidemic Spread in Smartpolis - Origin and Transmission

Background

Data Preparation & Dashboard Design

Insights

Conclusion

 


Data Preparation

Description Illustration
1.Data Cleaning


Tools: JMP
Method:
1. Split Created_at into Data and Time.
2. Split Location into Latitude and Longitude.
3. Exclude 21 rows with invalid Time input.
4. Export table into CSV format.

SY Clean data.png
2.Identify infected patients


Tools: Tableau
By loading the cleaned data into Tableau, we can draw a heat map to visualize the macroblog density per day in each location. From the heatmap of number of macroblogs, we know that there is a huge increase in the number of macroblogs posted on 19th & 20th of May. There must be some major events that caused the increase of macroblogs.

From the locations of macroblogs posted on 19th and 20th of May (as shown in the left figue), it is obvious that there is a high density of macroblogs around the hospital of Smartpolis (highlighted with black square). Which means these posts has a high possibility that is being posted by people who has been infected by the epidemic. By investigate what has those people posted and where has those people been to in the last few day can help us find where the outbreak started, how the infection is being transmitted and measure whether the outbreak is contained or not.
Select the macroblogs on map where the location is around the hospitals. Group user_id of these posts and create a set named patients. Extract a csv file that contains ID of all the people in patients set.

Heatmap of Number of Macroblogs by days:

SY num dis.png

Macroblogs distribution in the last day:

SY patients.png
3.Identify Symptom of Infected Patients - Data Preparation


Tools: JMP
Load patients ID file into JMP and join it with Macroblogs table. With text explorer of JMP, the top mentioned words and phrase posted by infected people are generated (left top figure). By filtering the words that are related to symptoms of the epidemic, we can tell that most patients were suffering a fever, cough, headache, diarrhea, vomit, sore throat, aching muscles, runny nose, difficulty in breath and so on.

By further investigating in the symptoms, it seems that the symptoms can be clustered into two categories, one related to gastrointestinal discomfort, the other related to inhalation discomfort. Hence, it is possible that the epidemic contains two type of diseases and may has two origins and multiple transmission methods. We chose 7 words from inhalation symptoms and 4 from gastrointestinal symptoms (shown on the left middle table) to identify origin and transmission method of the epidemic.

Create 11 columns with col_name of the 11 words selected, check if the text in each row contains the corresponding words. If yes, out put 1. If no, output 0.

Words detection:

Text Explorer
SY words.png

Words table:

Symptom Type Words
Inhalation chill, flu, sore throat, breath, pneumonia, fever, cough
Gastrointestinal stomachache, diarrhea, vomit, nausea

Check Symptoms:

Formula
SY formula.png
4.Identify Origin of the Epidemic - Data Visulazation


Tools: Tableau
To identify origin of the epidemic, we need to identify both when and where the symptoms start to appeal.
First, load the newly created table into tableau and join it with the original data with id and time of each row. To identify start time of the epidemic, we can use line chart to visualize the change in frequency of the symptoms across time (as shown on left). The pink charts represents the frequency of inhalation symptoms and the yellow charts represents the frequency of gastrointestinal symptoms.
To identify start location of the epidemic, we can visualize the location of macroblogs that contains the symptoms. To make the visualization more interactive, we can write formula in Tableau to group the symptoms in to inhalation and gastrointestinal. Then add inhalaton and gastrointestinal into filter, so that we can see origin of both type of disease.

Visualization:

Appeal Frequency Appeal Location
SY 02.png
SY 03.png
5.Identify Major Events in Smartpolis


Tools:Tableau, JMP
The epidemic broke on 18 May, it is highly possible that people get infected on 17 May. Hence, it is valuable to investigate on where did people go on 17 May and is there any major event happened in 17 May that spread the disease?

Use Tableau filter out all the macroblogs posted on 17 May, there seems to an unusual density change at the intersection of red road and Vast river. select points around the area and create a set called event. Export user ID of the event set.

In JMP, join the event table with macroblogs table. Filter out the row that are being posted on 17 May. From the output of text explorer, it seems that a terrible truck accident happened on 17 May around the intersection of Westside, Northville, Downtown and Plainville.
To further explore what are the major events happened in Smartpolis, we did text analysis for macroblogs post in each week (as shown on left). It seems that on 6 May, there is a car accident happened alone the boundary of Riverside and Suburbia. On 7 May, there was a music festival held in Downtown area. On 17 May, there was a truck accident happened in the intersection area of road 610 and Vast river. The first two events are held far before the break out of the epidemic, hence, the events shall had no effect on spread of the epidemic.


Event detection:

Group Text Exployer
SY 18.png
SY 17.png

Week key words explore:

Week 1 Week 2 Week 3
Week1.png
Week2.png
Week3.png

Key Events

Music Festival Car Accident Truck Accident
7 May 6 May 17 May
SY 26.png
SY 27.png
SY 28.png