Difference between revisions of "ISSS608 2017-18 T1 Assign WU YUQING Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
[[Image:yqwu_pic.jpg|150px]]  
 
[[Image:yqwu_pic.jpg|150px]]  
<font size =3; color="#FFFFFF"><b>&nbsp;&nbsp;Vast Challenge 2011 MC1:  Characterization of an Epidemic Spread</b></font>
+
<font size =3; color="#FFFFFF"><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Vast Challenge 2011 MC1:  Characterization of an Epidemic Spread</b></font>
 
</div>
 
</div>
 
<!--MAIN HEADER -->
 
<!--MAIN HEADER -->
Line 28: Line 28:
 
|  &nbsp;
 
|  &nbsp;
 
|}
 
|}
<font size="5"><font color="#8B4513">'''Data preparation'''</font></font><br>
+
<font size="5"><font color="#8B4513">'''Data preparation'''</font></font>&nbsp; <font size="3"><font color="#008000">'''Tool: JMP Pro'''</font></font><br>
 
[[Image:yqwu_datapre.png|right|300px]]
 
[[Image:yqwu_datapre.png|right|300px]]
 
1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.<br>
 
1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.<br>
Line 47: Line 47:
  
 
7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th. In order to achieve a better visual performance, some very frequent words and symbols such as ‘!’, ‘good’, ’come’, ’go’, ‘just’, ‘now’ and ‘today’ can be manually added into the stop word lists.<br>
 
7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th. In order to achieve a better visual performance, some very frequent words and symbols such as ‘!’, ‘good’, ’come’, ’go’, ‘just’, ‘now’ and ‘today’ can be manually added into the stop word lists.<br>
[[Image:WordCloud_0519.png|border|350px]]&nbsp;&nbsp;&nbsp;&nbsp;[[Image:Yqwu_WordCloud_0520.png|border|350px]]<br>
+
[[Image:WordCloud_0519.png|border|350px]]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [[Image:Yqwu_WordCloud_0520.png|border|350px]]<br>
 
''<font size=2><b>WordCloud of May 19th</b></font>''&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 
''<font size=2><b>WordCloud of May 19th</b></font>''&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
''<font size=2><b>WordCloud of May 20h</b></font>''<br>
+
''<font size=2><b>WordCloud of May 20th</b></font>''<br>
  
 
Based on the top frequent unigrams and bigrams and combining the introduction materials and exploration of the messages, the following unigrams and bigrams are determined to be used as the flu-related key words:<br>
 
Based on the top frequent unigrams and bigrams and combining the introduction materials and exploration of the messages, the following unigrams and bigrams are determined to be used as the flu-related key words:<br>

Latest revision as of 13:43, 15 October 2017

Yqwu pic.jpg           Vast Challenge 2011 MC1: Characterization of an Epidemic Spread

Background

Data Description

Data Preparation

Analysis & Solutions

Feedback

 

Data preparation  Tool: JMP Pro

Yqwu datapre.png

1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.

2. Split the column ‘Created_at’ into two separate columns ‘Date’ and ‘Time’ so that they can be used separately in the following steps.

3. Check the missing data pattern of ‘Date’ and ‘Time’. From the results, we can see that there are 21 rows with missing values in ‘Date’ and ‘Time’ as shown below. From the details of these 21 rows, we can see that the time of these messages in the column ‘Created_at’ are lost and shown as text.
Yqwu missing pattern.png   Yqwu 21rows.png

4. Exclude these 21 rows without the time of posted microblogs mentioned above. As the number of these rows is small, it should be reasonable to exclude them from the dataset, which will not affect the analysis result.

5. Create a new column ‘Flu’ using the following formula to label the rows with word ‘flu’ in the microblog so that we can have a quick check and detection of the flu-related microblogs later.
Yqwu flu formula.png

6. Subset the records with ‘Flu’=1, which are the messages with ‘flu’ in the text (8727 rows) and plot these record’s distribution by Date. The distribution below shows that the messages with ‘flu’ increase significantly on the last two days (May 19th and May 20th). Thus, before the further analysis, we can tentatively assume that the epidemic outbroke in the last few days of the period.
Yqwu lu.png
Distribution of Messages with 'flu'

7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th. In order to achieve a better visual performance, some very frequent words and symbols such as ‘!’, ‘good’, ’come’, ’go’, ‘just’, ‘now’ and ‘today’ can be manually added into the stop word lists.
WordCloud 0519.png       Yqwu WordCloud 0520.png
WordCloud of May 19th                                                             WordCloud of May 20th

Based on the top frequent unigrams and bigrams and combining the introduction materials and exploration of the messages, the following unigrams and bigrams are determined to be used as the flu-related key words:
‘headache’, 'die', 'medicine', 'death', 'breath', 'stomach', 'pneumonia', 'diarrhea', 'cough', 'flu', 'fever', 'chill', 'sweat', 'ache', ’pain', 'fatigue', 'nausea', 'vomit', 'lymph', 'sick', 'throat sore', 'runny nose', 'muscle', ‘sleep', 'ill'
These key words will be used to detect the flu-related messages in the following steps.

8. Create a new column ‘No. Keyword’ to count the number of flu-related key words mentioned above in each microblog. The following formula is used to count the number of every keyword in each message.
Yqwu No.keys formula.png

9. Create a new column ‘Report_illness’, which is a binary variable to label the microblogs with at least two key words in the text using the following formula.
Yqwu report illness formula.png

10. Subset the records with ‘Report_illness=1’ and create a new dataset ‘preprocess.txt’ to stand for all the flu-related microblogs (35744 rows). The sample data from ‘preprocess.txt’ is as follows.
Yqwu preprocess sample.png