ISS608 2017-18 T1 Assign KyonghwanKim Data Preparation

From Visual Analytics and Applications
Revision as of 04:01, 15 October 2017 by (talk | contribs)
Jump to navigation Jump to search


Vastropolis Epidemic Report


Data Preparation







1. Data cleaning

Description Illustration
1. Split of Columns
  • Created_at column is splitted to Date and Time columns. Date column is used in other analytics.
  • Also, Location column is splitted to Latitude and Longitude columns. These data is used to plot in Vastropolis map.
Microblog split.png
2. Outliers
  • There are 21 items with invalid time format. They are removed from analysis.
  • Also, there are 6 items with Longitude outside of given map range. They are removed as well so that all data are within parameters.
  • Total 27 rows are removed and 1,023,050 rows are used for analysis with file name "Microblog_Final.csv".
Missing time.png Outlier Longitude.png

2. Key Words

Following 17 key words and 2 phrases are used for text analysis. Additionally, word "flu" and "cold" are categorized as Diagnosis. This is because these words actually describe our target status.

Diagnosis Symptoms
flu, cold fever, chill, fatigue, cough, breath, nausea, vomit, diarrhea, sweat, pain, sore throat, muscle, letharg (-y or -ic), runny nose, doctor, sick

If only Diagnosis key words are selected from dataset, contagion shoot up since May 18th, 2011 as shown on below graph. According to CDC, flu symptoms start 1 to 4 days after the virus enters the body. That means that you may be able to pass on the flu to someone else before you know you are sick, as well as while you are sick.[1] This analysis look closely from 4 days prior to outbreak, which is May 14th, 2011.
