Data Preparation Process

From Visual Analytics and Applications
Revision as of 23:29, 15 October 2017 by Chen.zhou.2016 (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

1.jpg ISSS608 Visual Analytics and Applications Assign ZHOU CHEN

Introduction

Data Preparation Process

Insights

Conclusion

 


Raw data process

1. Exclude row with missing value

5-1.png


2. Split location column

  • Split the location column into two columns, named N and W
  • Recode the column W into negative based on the direction of map given
5-2.png


3. Label the area zone for each data

  • Use Graph Builder to locate all the points into the map:choose map as the background in graph builder and select N as y-axis, W as x-axis
  • Use Lasso tool to select the points zone by zone and use magnifier tool to double check the points located on the border
  • Label the select rows for certain zone
5-3.png


5-4.png


Text mining

Consider it is microblog sent by everyone, what this assignment interested is related to flu. Hence, it is important to extract the data that related to flu topic.

  • List all the words related to the flu symptoms: fever, chills,sweats, aches and pains, fatigue, coughing, breathing difficulty, nausea and vomiting, diarrhea, lymph, flu, killingme, sick, ill, die, death
  • Use text explorer to extract the messages contain each word and label them
5-5.png


5-6.png


Generate target table

There are some cases that people mentioned the certain symptoms but not are really ill and they may just talk about others with such symptoms. In order to prevent false positives appears, better filters of flulike people are important.

  • Count the numbers of words extracted before in each message, named as Symptoms column
  • Count the numbers of words (more reprehensive words: fever, chills,sweats, aches and pains, fatigue, coughing, breathing difficulty, nausea and vomiting, diarrhea, lymph) in each message, named as Symp_Details column. This is because some words (die ect.) calculated in the first step can be applied in other topics. This category is more concise.
5-7.png


  • Label ID with more possibility of really ill as 1, or else label 0 using following formula
5-8.png


5-9.png


  • Select people with high possibility of illness and find the date, time and location of first message related to symptoms sent by them
5-10.png


This filter reduces significantly the number of IDs to 17892 rows. However, it increases the accuracy and guarantees the IDs in the table are truly ill. With this new table, it is more accurate and easier to visualize and find the pattern of flu.