Difference between revisions of "ISS608 2017-18 T1 Assign KyonghwanKim Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[[file:title.png]]
 
[[file:title.png]]
<div style=background:#B4CE20 border:#D8CBA4>
+
<div style=background:#8B8B8B border:#8B8B8B>
 
<font size = 6; color="#000000">Vastropolis Epidemic Report</font>
 
<font size = 6; color="#000000">Vastropolis Epidemic Report</font>
 
</div>
 
</div>
Line 6: Line 6:
 
<!--MAIN HEADER -->
 
<!--MAIN HEADER -->
 
{|style="background-color:#B4CE20;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
 
{|style="background-color:#B4CE20;" width="100%" cellspacing="0" cellpadding="0" valign="top" border="0"  |
| style="font-family:Century Gothic; font-size:100%; solid #000000; background:#B4CE20; text-align:center;" width="16.67%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #000000; background:#8B8B8B; text-align:center;" width="20%" |  
 
;
 
;
 
[[ISS608_2017-18_T1_Assign_KyonghwanKim| <font color="#000000">Background</font>]]
 
[[ISS608_2017-18_T1_Assign_KyonghwanKim| <font color="#000000">Background</font>]]
  
| style="font-family:Century Gothic; font-size:100%; solid #D8CBA4; background:#F18D50; text-align:center;" width="16.67%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #8B8B8B; background:#EEDB1A; text-align:center;" width="20%" |  
 
;
 
;
 
[[ISS608_2017-18_T1_Assign_KyonghwanKim_Data_Preparation| <font color="#000000">Data Preparation</font>]]
 
[[ISS608_2017-18_T1_Assign_KyonghwanKim_Data_Preparation| <font color="#000000">Data Preparation</font>]]
  
| style="font-family:Century Gothic; font-size:100%; solid #D8CBA4; background:#B4CE20; text-align:center;" width="16.67%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #8B8B8B; background:#8B8B8B; text-align:center;" width="20%" |  
 
;
 
;
 
[[ISS608_2017-18_T1_Assign_KyonghwanKim_Visualization| <font color="#000000">Visualization</font>]]
 
[[ISS608_2017-18_T1_Assign_KyonghwanKim_Visualization| <font color="#000000">Visualization</font>]]
  
| style="font-family:Century Gothic; font-size:100%; solid #D8CBA4; background:#B4CE20; text-align:center;" width="16.67%" |  
+
| style="font-family:Century Gothic; font-size:100%; solid #8B8B8B; background:#8B8B8B; text-align:center;" width="20%" |  
 
;
 
;
[[ISS608_2017-18_T1_Assign_KyonghwanKim_Answer| <font color="#000000">Answer</font>]]
+
[[ISS608_2017-18_T1_Assign_KyonghwanKim_Solution| <font color="#000000">Solution</font>]]
  
| style="font-family:Century Gothic; font-size:100%; solid #D8CBA4; background:#B4CE20; text-align:center;" width="16.67%" |
+
| style="font-family:Century Gothic; font-size:100%; solid #8B8B8B; background:#8B8B8B; text-align:center;" width="20%" |  
;
 
[[ISS608_2017-18_T1_Assign_KyonghwanKim_Reference| <font color="#000000">Reference</font>]]
 
 
 
| style="font-family:Century Gothic; font-size:100%; solid #D8CBA4; background:#B4CE20; text-align:center;" width="16.67%" |  
 
 
;
 
;
 
[[Talk:ISS608_2017-18_T1_Assign_KyonghwanKim_Feedback| <font color="#000000">Feedback</font>]]
 
[[Talk:ISS608_2017-18_T1_Assign_KyonghwanKim_Feedback| <font color="#000000">Feedback</font>]]
Line 33: Line 29:
 
|}
 
|}
 
<br/>
 
<br/>
 
  
  
Line 47: Line 42:
 
*Created_at column is splitted to Date and Time columns. Date column is used in other analytics.<br/>
 
*Created_at column is splitted to Date and Time columns. Date column is used in other analytics.<br/>
 
*Also, Location column is splitted to Latitude and Longitude columns. These data is used to plot in Vastropolis map.
 
*Also, Location column is splitted to Latitude and Longitude columns. These data is used to plot in Vastropolis map.
|[[file:microblog_split.png]]
+
|[[file:microblog_split.png|450px]]
 
|-
 
|-
 
|'''2. Outliers'''<br/>
 
|'''2. Outliers'''<br/>
 
*There are 21 items with invalid time format. They are removed from analysis.<br/>
 
*There are 21 items with invalid time format. They are removed from analysis.<br/>
 
*Also, there are 6 items with Longitude outside of given map range. They are removed as well so that all data are within parameters.
 
*Also, there are 6 items with Longitude outside of given map range. They are removed as well so that all data are within parameters.
*Total 27 rows are removed and 1,023,050 rows are used for analysis with file name "Microblog_Final.csv".
+
*Total 27 rows are removed and 1,023,050 rows are used for analysis with file name '''''"Microblog_Final.csv"'''''.
|[[file:missing_time.png]]    [[file:outlier_Longitude.png]]
+
|[[file:missing_time.png|150px]]    [[file:outlier_Longitude.png|300px]]
 
|-
 
|-
 
|}
 
|}
  
 
==2. Key Words==
 
==2. Key Words==
Following 17 key words and 2 phrases are used for text analysis. Additionally, word "flu" and "cold" are categorized as '''Diagnosis'''. This is because these words actually describe our target status.
+
Some of key words are given. However, there may be additional key words to enhance accuracy of analysis. <br/>
<br/>
+
 
{| class="wikitable"
+
[[file:outbreak.png|400px]]
|-
+
 
|style="text-align:center;" |'''Diagnosis'''
+
Key word ''"flu"'' and ''"cold"'' are chosen as they are diagnosis words whereas other words are symptoms. Above graph shows the text distribution by "Date" that contains diagnosis words. Text traffic shoots up from May 18th and remain high until 20th. Word Cloud during 3 days of outbreak using JMP text explorer are shown below.
|style="text-align:center;" |'''Symptoms'''
+
 
|-
+
[[file:cloud.png|300px]]
|flu, cold
+
 
|fever, chill, fatigue, cough, breath, nausea, vomit, diarrhea, sweat, pain, ''sore throat'', muscle, letharg (-y or -ic), ''runny nose'', doctor, sick
+
Therefore, following key words are chosen for analysis.
|-
+
*Given: '''''flu''''', fever, chill(s), sweat(s), <span style="color: blue"><del>aches</del></span>, pain(s), fatigue, cough(ing), ''breathing difficulty'', nausea, vomit(ing), diarrhea, ''enlarged lymph nodes''
|}
+
*Enhanced: '''''cold''''', <span style="color: blue">headache</span>, sick, ''shortness of breath'', ''declining health'', ''hurts to move'', ''aching muscles'', ''sore throat'', ''runny nose'', ''problems breathing'', pneumonia
If only '''Diagnosis''' key words are selected from dataset, contagion shoot up since May 18th, 2011 as shown on below graph. According to CDC, ''flu symptoms start 1 to 4 days after the virus enters the body. That means that you may be able to pass on the flu to someone else before you know you are sick, as well as while you are sick.''[https://www.cdc.gov/flu/about/disease/spread.htm] This analysis look closely from 4 days prior to outbreak, which is May 14th, 2011.<br/>
+
 
 +
==3. Contagion Flag==
 +
Text containing above 23 words and phrases are chosen from dataset '''''"Microblog_Final.csv"'''''.
 +
 
 +
[[file:contagion_flag.png]]
  
[[file:outbreak.png]]
+
'''Diagnosis Flag''' is text containing diagnosis words: ''"flu"'' and ''"cold"''. '''Symptom Flag''' is text containing at least 2 of all other key words apart from diagnosis words. Any text containing at least 1 of Diagnosis words or at least 2 of Symptom words are classified as '''Contagion Flag''' (20,466 rows) which is used for Visualization analysis. '''''"Contagion_Flag.csv"'''''

Latest revision as of 23:13, 15 October 2017

Title.png

Vastropolis Epidemic Report

Background

Data Preparation

Visualization

Solution

Feedback

 



Microblog

1. Data cleaning

Description Illustration
1. Split of Columns
  • Created_at column is splitted to Date and Time columns. Date column is used in other analytics.
  • Also, Location column is splitted to Latitude and Longitude columns. These data is used to plot in Vastropolis map.
Microblog split.png
2. Outliers
  • There are 21 items with invalid time format. They are removed from analysis.
  • Also, there are 6 items with Longitude outside of given map range. They are removed as well so that all data are within parameters.
  • Total 27 rows are removed and 1,023,050 rows are used for analysis with file name "Microblog_Final.csv".
Missing time.png Outlier Longitude.png

2. Key Words

Some of key words are given. However, there may be additional key words to enhance accuracy of analysis.

Outbreak.png

Key word "flu" and "cold" are chosen as they are diagnosis words whereas other words are symptoms. Above graph shows the text distribution by "Date" that contains diagnosis words. Text traffic shoots up from May 18th and remain high until 20th. Word Cloud during 3 days of outbreak using JMP text explorer are shown below.

Cloud.png

Therefore, following key words are chosen for analysis.

  • Given: flu, fever, chill(s), sweat(s), aches, pain(s), fatigue, cough(ing), breathing difficulty, nausea, vomit(ing), diarrhea, enlarged lymph nodes
  • Enhanced: cold, headache, sick, shortness of breath, declining health, hurts to move, aching muscles, sore throat, runny nose, problems breathing, pneumonia

3. Contagion Flag

Text containing above 23 words and phrases are chosen from dataset "Microblog_Final.csv".

Contagion flag.png

Diagnosis Flag is text containing diagnosis words: "flu" and "cold". Symptom Flag is text containing at least 2 of all other key words apart from diagnosis words. Any text containing at least 1 of Diagnosis words or at least 2 of Symptom words are classified as Contagion Flag (20,466 rows) which is used for Visualization analysis. "Contagion_Flag.csv"