Difference between revisions of "ISSS608 2017-18 T1 Assign WU YUQING Data Preparation"

From Visual Analytics and Applications
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
[[Image:yqwu_pic.jpg|150px]]  
 
[[Image:yqwu_pic.jpg|150px]]  
<font size =3; color="#FFFFFF"><b>&nbsp;&nbsp;Vast Challenge 2011 MC1:  Characterization of an Epidemic Spread</b></font>
+
<font size =3; color="#FFFFFF"><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Vast Challenge 2011 MC1:  Characterization of an Epidemic Spread</b></font>
 
</div>
 
</div>
 
<!--MAIN HEADER -->
 
<!--MAIN HEADER -->
Line 12: Line 12:
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:center;" width="20%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:center;" width="20%" |  
 
;
 
;
[[ISSS608 2017-18 T1 Assign WU YUQING Data Descrption| <b><font color="#FFFFFF">Data Description</font></b>]]
+
[[ISSS608 2017-18 T1 Assign WU YUQING Data Description| <b><font color="#FFFFFF">Data Description</font></b>]]
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:center;" width="20%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:center;" width="20%" |  
 
;
 
;
[[<font size=2>ISSS608 2017-18 T1 Assign WU YUQING Data Preparation</font>| <b><font color="#FFFFFF">Data Preparation</font></font></b>]]
+
[[ISSS608 2017-18 T1 Assign WU YUQING Data Preparation| <b><font color="#FFFFFF">Data Preparation</font></b>]]
  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:right;" width="20%" |  
 
| style="font-family:Century Gothic; font-size:100%; solid #1B338F; background:#2B3856; text-align:right;" width="20%" |  
Line 28: Line 28:
 
|  &nbsp;
 
|  &nbsp;
 
|}
 
|}
<font size="5"><font color="#8B4513">'''Data preparation'''</font></font><br>
+
<font size="5"><font color="#8B4513">'''Data preparation'''</font></font>&nbsp; <font size="3"><font color="#008000">'''Tool: JMP Pro'''</font></font><br>
 
[[Image:yqwu_datapre.png|right|300px]]
 
[[Image:yqwu_datapre.png|right|300px]]
 
1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.<br>
 
1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.<br>
2. Split the column ‘Created_at’ into two separate columns ‘Date’ and ‘Time’ so that they can be used separately in the following steps.
+
 
3. Check the missing data pattern of ‘Date’ and ‘Time’. From the results, we can see that there are 21 rows with missing values in ‘Date’ and ‘Time’ as shown below. From the details of these 21 rows, we can see that the time of these messages in the column ‘Created_at’ are lost and shown as text.
+
2. Split the column ‘Created_at’ into two separate columns ‘Date’ and ‘Time’ so that they can be used separately in the following steps.<br>
 +
 
 +
3. Check the missing data pattern of ‘Date’ and ‘Time’. From the results, we can see that there are 21 rows with missing values in ‘Date’ and ‘Time’ as shown below. From the details of these 21 rows, we can see that the time of these messages in the column ‘Created_at’ are lost and shown as text.<br>
 +
[[Image:yqwu_missing_pattern.png|border|320px]]&nbsp;&nbsp;      [[Image:yqwu_21rows.png|border|400px]]<br>
 
   
 
   
 +
4. Exclude these 21 rows without the time of posted microblogs mentioned above. As the number of these rows is small, it should be reasonable to exclude them from the dataset, which will not affect the analysis result.<br>
 +
 +
5. Create a new column ‘Flu’ using the following formula to label the rows with word ‘flu’ in the microblog so that we can have a quick check and detection of the flu-related microblogs later.<br>
 +
[[Image:yqwu_flu_formula.png|border|250px]]<br>
 
   
 
   
4. Exclude these 21 rows without the time of posted microblogs mentioned above. As the number of these rows is small, it should be reasonable to exclude them from the dataset, which will not affect the analysis result.
+
6. Subset the records with ‘Flu’=1, which are the messages with ‘flu’ in the text (8727 rows) and plot these record’s distribution by Date. The distribution below shows that the messages with ‘flu’ increase significantly on the last two days (May 19th and May 20th). Thus, before the further analysis, we can tentatively assume that the epidemic outbroke in the last few days of the period.<br>
5. Create a new column ‘Flu’ using the following formula to label the rows with word ‘flu’ in the microblog so that we can have a quick check and detection of the flu-related microblogs later.
+
[[Image:yqwu_lu.png|border|350px]]<br>
+
''<font size=2><b>Distribution of Messages with 'flu'</b></font>''
6. Subset the records with ‘Flu’=1, which are the messages with ‘flu’ in the text (8727 rows) and plot these record’s distribution by Date. The distribution below shows that the messages with ‘flu’ increase significantly on the last two days (May 19th and May 20th). Thus, before the further analysis, we can tentatively assume that the epidemic outbroke in the last few days of the period.
+
 
+
7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th. In order to achieve a better visual performance, some very frequent words and symbols such as ‘!’, ‘good’, ’come’, ’go’, ‘just’, ‘now’ and ‘today’ can be manually added into the stop word lists.<br>
7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th.  
+
[[Image:WordCloud_0519.png|border|350px]]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [[Image:Yqwu_WordCloud_0520.png|border|350px]]<br>
   
+
''<font size=2><b>WordCloud of May 19th</b></font>''&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
 +
''<font size=2><b>WordCloud of May 20th</b></font>''<br>
 +
 
 +
Based on the top frequent unigrams and bigrams and combining the introduction materials and exploration of the messages, the following unigrams and bigrams are determined to be used as the flu-related key words:<br>
 +
‘headache’, 'die', 'medicine', 'death', 'breath', 'stomach', 'pneumonia', 'diarrhea', 'cough', 'flu', 'fever', 'chill', 'sweat', 'ache', ’pain', 'fatigue', 'nausea', 'vomit', 'lymph', 'sick', 'throat sore', 'runny nose', 'muscle', ‘sleep', 'ill'<br>
 +
These key words will be used to detect the flu-related messages in the following steps.<br>
  
 +
8. Create a new column ‘No. Keyword’ to count the number of flu-related key words mentioned above in each microblog. The following formula is used to count the number of every keyword in each message.<br>
 +
[[Image:yqwu_No.keys formula.png|border|300px]]<br>
 
   
 
   
From the top frequent unigrams and bigrams and combination of the introduction materials and exploration of the message, the following unigrams and bigrams are determined to be used as the flu-related key words:
+
9. Create a new column ‘Report_illness’, which is a binary variable to label the microblogs with at least two key words in the text using the following formula.<br>
‘headache’, 'die', 'medicine', 'death', 'breath', 'stomach', 'pneumonia', 'diarrhea', 'cough', 'flu', 'fever', 'chill', 'sweat', 'ache', ’pain', 'fatigue', 'nausea', 'vomit', 'lymph', 'sick', 'throat sore', 'runny nose', 'muscle', ‘sleep', 'ill'
+
[[Image:yqwu_report_illness_formula.png|border|150px]]<br>
These key words will be used to recognize the flu-related messages in the following steps.
 
8. Create a new column ‘No. Keyword’ to count the number of flu-related key words mentioned above in each microblog. The following formula is used to count the number.
 
 
9. Create a new column ‘Report_illness’, which is a binary variable to label the microblogs with at least two key words in the text using the following formula.
 
 
   
 
   
10. Subset the records with ‘Report_illness=1’ and create a new dataset ‘preprocess.txt’ to stand for all the flu-related microblogs (35744 rows). The sample data from ‘preprocess.txt’ is as follows.
+
10. Subset the records with ‘Report_illness=1’ and create a new dataset ‘preprocess.txt’ to stand for all the flu-related microblogs (35744 rows). The sample data from ‘preprocess.txt’ is as follows.<br>
 +
[[Image:yqwu_preprocess_sample.png|border|600px]]<br>

Latest revision as of 13:43, 15 October 2017

Yqwu pic.jpg           Vast Challenge 2011 MC1: Characterization of an Epidemic Spread

Background

Data Description

Data Preparation

Analysis & Solutions

Feedback

 

Data preparation  Tool: JMP Pro

Yqwu datapre.png

1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.

2. Split the column ‘Created_at’ into two separate columns ‘Date’ and ‘Time’ so that they can be used separately in the following steps.

3. Check the missing data pattern of ‘Date’ and ‘Time’. From the results, we can see that there are 21 rows with missing values in ‘Date’ and ‘Time’ as shown below. From the details of these 21 rows, we can see that the time of these messages in the column ‘Created_at’ are lost and shown as text.
Yqwu missing pattern.png   Yqwu 21rows.png

4. Exclude these 21 rows without the time of posted microblogs mentioned above. As the number of these rows is small, it should be reasonable to exclude them from the dataset, which will not affect the analysis result.

5. Create a new column ‘Flu’ using the following formula to label the rows with word ‘flu’ in the microblog so that we can have a quick check and detection of the flu-related microblogs later.
Yqwu flu formula.png

6. Subset the records with ‘Flu’=1, which are the messages with ‘flu’ in the text (8727 rows) and plot these record’s distribution by Date. The distribution below shows that the messages with ‘flu’ increase significantly on the last two days (May 19th and May 20th). Thus, before the further analysis, we can tentatively assume that the epidemic outbroke in the last few days of the period.
Yqwu lu.png
Distribution of Messages with 'flu'

7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th. In order to achieve a better visual performance, some very frequent words and symbols such as ‘!’, ‘good’, ’come’, ’go’, ‘just’, ‘now’ and ‘today’ can be manually added into the stop word lists.
WordCloud 0519.png       Yqwu WordCloud 0520.png
WordCloud of May 19th                                                             WordCloud of May 20th

Based on the top frequent unigrams and bigrams and combining the introduction materials and exploration of the messages, the following unigrams and bigrams are determined to be used as the flu-related key words:
‘headache’, 'die', 'medicine', 'death', 'breath', 'stomach', 'pneumonia', 'diarrhea', 'cough', 'flu', 'fever', 'chill', 'sweat', 'ache', ’pain', 'fatigue', 'nausea', 'vomit', 'lymph', 'sick', 'throat sore', 'runny nose', 'muscle', ‘sleep', 'ill'
These key words will be used to detect the flu-related messages in the following steps.

8. Create a new column ‘No. Keyword’ to count the number of flu-related key words mentioned above in each microblog. The following formula is used to count the number of every keyword in each message.
Yqwu No.keys formula.png

9. Create a new column ‘Report_illness’, which is a binary variable to label the microblogs with at least two key words in the text using the following formula.
Yqwu report illness formula.png

10. Subset the records with ‘Report_illness=1’ and create a new dataset ‘preprocess.txt’ to stand for all the flu-related microblogs (35744 rows). The sample data from ‘preprocess.txt’ is as follows.
Yqwu preprocess sample.png