Difference between revisions of "ISSS608 2017-18 T1 Assign WU YUQING Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 16: Line 16:
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:center;" width="20%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:center;" width="20%" |  
 
;
 
;
[[<font size=2>ISSS608 2017-18 T1 Assign WU YUQING Data Preparation</font>| <b><font color="#FFFFFF">Data Preparation</font></font></b>]]
+
[[ISSS608 2017-18 T1 Assign WU YUQING Data Preparation| <b><font color="#FFFFFF">Data Preparation</font></b>]]
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:right;" width="20%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:right;" width="20%" |  

Revision as of 15:14, 14 October 2017

Yqwu pic.jpg   Vast Challenge 2011 MC1: Characterization of an Epidemic Spread

Background

Data Description

Data Preparation

Analysis & Solutions

Feedback

 

Data preparation

Yqwu datapre.png

1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.
2. Split the column ‘Created_at’ into two separate columns ‘Date’ and ‘Time’ so that they can be used separately in the following steps. 3. Check the missing data pattern of ‘Date’ and ‘Time’. From the results, we can see that there are 21 rows with missing values in ‘Date’ and ‘Time’ as shown below. From the details of these 21 rows, we can see that the time of these messages in the column ‘Created_at’ are lost and shown as text.


4. Exclude these 21 rows without the time of posted microblogs mentioned above. As the number of these rows is small, it should be reasonable to exclude them from the dataset, which will not affect the analysis result. 5. Create a new column ‘Flu’ using the following formula to label the rows with word ‘flu’ in the microblog so that we can have a quick check and detection of the flu-related microblogs later.

6. Subset the records with ‘Flu’=1, which are the messages with ‘flu’ in the text (8727 rows) and plot these record’s distribution by Date. The distribution below shows that the messages with ‘flu’ increase significantly on the last two days (May 19th and May 20th). Thus, before the further analysis, we can tentatively assume that the epidemic outbroke in the last few days of the period.

7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th.


From the top frequent unigrams and bigrams and combination of the introduction materials and exploration of the message, the following unigrams and bigrams are determined to be used as the flu-related key words: ‘headache’, 'die', 'medicine', 'death', 'breath', 'stomach', 'pneumonia', 'diarrhea', 'cough', 'flu', 'fever', 'chill', 'sweat', 'ache', ’pain', 'fatigue', 'nausea', 'vomit', 'lymph', 'sick', 'throat sore', 'runny nose', 'muscle', ‘sleep', 'ill' These key words will be used to recognize the flu-related messages in the following steps. 8. Create a new column ‘No. Keyword’ to count the number of flu-related key words mentioned above in each microblog. The following formula is used to count the number.

9. Create a new column ‘Report_illness’, which is a binary variable to label the microblogs with at least two key words in the text using the following formula.

10. Subset the records with ‘Report_illness=1’ and create a new dataset ‘preprocess.txt’ to stand for all the flu-related microblogs (35744 rows). The sample data from ‘preprocess.txt’ is as follows.