Difference between revisions of "ISSS608 2017-18 T1 Assign WU YUQING Data Description"

From Visual Analytics and Applications
Jump to navigation Jump to search
Line 11: Line 11:
 
=<font size=4>'''Smartpolis Map'''</font>=
 
=<font size=4>'''Smartpolis Map'''</font>=
 
The provided PNG image file contains map information for the entire metropolitan area with labeled highways, hospitals, important landmarks, and water bodies.<br>
 
The provided PNG image file contains map information for the entire metropolitan area with labeled highways, hospitals, important landmarks, and water bodies.<br>
[[Image:yqwu_map.png|400px]]<br>
+
[[Image:yqwu_map.png|500px]]<br>
[[Image:Smartpolis Map.png|400px]]
 
 
=<font size=4>'''Population Statistics'''</font>=
 
=<font size=4>'''Population Statistics'''</font>=
 
The following three attributes are provided:<br>
 
The following three attributes are provided:<br>
Line 21: Line 20:
 
=<font size=4>'''Observed Weather'''</font>=
 
=<font size=4>'''Observed Weather'''</font>=
 
The following four attributes are provided:<br>
 
The following four attributes are provided:<br>
Date – date of observed weather by weather station
+
Date – date of observed weather by weather station<br>
Weather – weather conditions for a particular day
+
Weather – weather conditions for a particular day<br>
Average_Wind_Speed – measured in miles per hour
+
Average_Wind_Speed – measured in miles per hour<br>
Wind_Direction – the direction from which the wind is blowing or from which it originates
+
Wind_Direction – the direction from which the wind is blowing or from which it originates<br>
  
• ADDITIONAL INFORMATION
+
=<font size=4>'''Additional Information'''</font>=
1) Economy – The economy of Smartopolis is based on commerce, entertainment, finance, trucking services, shipping services, health care, and industry.
+
1) Economy – The economy of Smartopolis is based on commerce, entertainment, finance, trucking services, shipping services, health care, and industry.<br>
2) Water Supply - Residents and businesses get their drinking water by pumping water from nearby reservoirs or rivers.  These distributed water systems are both public and privately owned.
+
2) Water Supply - Residents and businesses get their drinking water by pumping water from nearby reservoirs or rivers.  These distributed water systems are both public and privately owned.<br>
3) Entertainment – Smartopolis has two stadiums (Vastopolis Dome and Westside Stadium) for sports, concerts, and other events.  The various lakes and the Vast River, which flows south at a steady rate of three miles per hour, is used for water-based sports and recreation.
+
3) Entertainment – Smartopolis has two stadiums (Vastopolis Dome and Westside Stadium) for sports, concerts, and other events.  The various lakes and the Vast River, which flows south at a steady rate of three miles per hour, is used for water-based sports and recreation.<br>
4) City Administration – Smartopolis has several locations of significance including a state courthouse, a capitol building, convention center, and a large airport.
+
4) City Administration – Smartopolis has several locations of significance including a state courthouse, a capitol building, convention center, and a large airport.<br>
Data preparation
 
1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively.
 
2. Split the column ‘Created_at’ into two separate columns ‘Date’ and ‘Time’ so that they can be used separately in the following steps.
 
3. Check the missing data pattern of ‘Date’ and ‘Time’. From the results, we can see that there are 21 rows with missing values in ‘Date’ and ‘Time’ as shown below. From the details of these 21 rows, we can see that the time of these messages in the column ‘Created_at’ are lost and shown as text.
 
 
 
4. Exclude these 21 rows without the time of posted microblogs mentioned above. As the number of these rows is small, it should be reasonable to exclude them from the dataset, which will not affect the analysis result.
 
5. Create a new column ‘Flu’ using the following formula to label the rows with word ‘flu’ in the microblog so that we can have a quick check and detection of the flu-related microblogs later.
 
 
6. Subset the records with ‘Flu’=1, which are the messages with ‘flu’ in the text (8727 rows) and plot these record’s distribution by Date. The distribution below shows that the messages with ‘flu’ increase significantly on the last two days (May 19th and May 20th). Thus, before the further analysis, we can tentatively assume that the epidemic outbroke in the last few days of the period.
 
 
7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th.
 
 
 
 
 
From the top frequent unigrams and bigrams and combination of the introduction materials and exploration of the message, the following unigrams and bigrams are determined to be used as the flu-related key words:
 
‘headache’, 'die', 'medicine', 'death', 'breath', 'stomach', 'pneumonia', 'diarrhea', 'cough', 'flu', 'fever', 'chill', 'sweat', 'ache', ’pain', 'fatigue', 'nausea', 'vomit', 'lymph', 'sick', 'throat sore', 'runny nose', 'muscle', ‘sleep', 'ill'
 
These key words will be used to recognize the flu-related messages in the following steps.
 
8. Create a new column ‘No. Keyword’ to count the number of flu-related key words mentioned above in each microblog. The following formula is used to count the number.
 
 
9. Create a new column ‘Report_illness’, which is a binary variable to label the microblogs with at least two key words in the text using the following formula.
 
 
10. Subset the records with ‘Report_illness=1’ and create a new dataset ‘preprocess.txt’ to stand for all the flu-related microblogs (35744 rows). The sample data from ‘preprocess.txt’ is as follows.
 

Revision as of 00:30, 14 October 2017

Data Description

Microblog Messages

The provided CSV file contains 1,023,077 microblog messages from the users post between April 30th, 2011 to May 20th, 2011. The following four attributes are provided in the dataset (Please see the sample below):
• ID – personal identifier of the individual posting the message
• Created_at – date and time of the post
• Location – latitude and longitude coordinates of the mobile device at the time of post
• Text – the posted message
Yqwu dataset sample.png

Smartpolis Map

The provided PNG image file contains map information for the entire metropolitan area with labeled highways, hospitals, important landmarks, and water bodies.
Yqwu map.png

Population Statistics

The following three attributes are provided:
• Zone_Name – the name of one of the 13 city zones within the metropolitan area
• Population_Density – the number of residents in the zone
• Daytime_Population – the estimated population in the zone due to commuting during work hours

Observed Weather

The following four attributes are provided:
• Date – date of observed weather by weather station
• Weather – weather conditions for a particular day
• Average_Wind_Speed – measured in miles per hour
• Wind_Direction – the direction from which the wind is blowing or from which it originates

Additional Information

1) Economy – The economy of Smartopolis is based on commerce, entertainment, finance, trucking services, shipping services, health care, and industry.
2) Water Supply - Residents and businesses get their drinking water by pumping water from nearby reservoirs or rivers. These distributed water systems are both public and privately owned.
3) Entertainment – Smartopolis has two stadiums (Vastopolis Dome and Westside Stadium) for sports, concerts, and other events. The various lakes and the Vast River, which flows south at a steady rate of three miles per hour, is used for water-based sports and recreation.
4) City Administration – Smartopolis has several locations of significance including a state courthouse, a capitol building, convention center, and a large airport.