Difference between revisions of "ISSS608 2017-18 T1 Assign WU YUQING Data Description"

From Visual Analytics and Applications
Jump to navigation Jump to search
(Created page with " Data description • Microblog Messages The provided CSV file contains 1,023,077 microblog messages from the users post between April 30th, 2011 to May 20th, 2011. The fo...")
 
Line 1: Line 1:
  
 
+
<font size="5"><font color="#8B4513">'''Data Description'''</font></font><br>
 
+
=<font size=4>'''Microblog Messages'''</font>=
 
+
The provided CSV file contains 1,023,077 microblog messages from the users post between April 30th, 2011 to May 20th, 2011. The following four attributes are provided in the dataset (Please see the sample below):<br>
 
+
ID – personal identifier of the individual posting the message<br>
Data description
+
Created_at – date and time of the post<br>
Microblog Messages
+
Location – latitude and longitude coordinates of the mobile device at the time of post<br>
The provided CSV file contains 1,023,077 microblog messages from the users post between April 30th, 2011 to May 20th, 2011. The following four attributes are provided in the dataset (Please see the sample below):
+
Text – the posted message<br>
ID – personal identifier of the individual posting the message
 
Created_at – date and time of the post
 
Location – latitude and longitude coordinates of the mobile device at the time of post
 
Text – the posted message
 
 
   
 
   
 
+
=<font size=4>'''Smartpolis Map'''</font>=
Smartpolis Map
 
 
The provided PNG image file contains map information for the entire metropolitan area with labeled highways, hospitals, important landmarks, and water bodies.  
 
The provided PNG image file contains map information for the entire metropolitan area with labeled highways, hospitals, important landmarks, and water bodies.  
+
=<font size=4>'''Population Statistics'''</font>=
 
+
The following three attributes are provided:<br>
POPULATION STATISTICS
+
• Zone_Name – the name of one of the 13 city zones within the metropolitan area<br>
Attributes:
+
Population_Density – the number of residents in the zone<br>
Zone_Name – the name of one of the 13 city zones within the metropolitan area
+
Daytime_Population – the estimated population in the zone due to commuting during work hours<br>
Population_Density – the number of residents in the zone
 
Daytime_Population – the estimated population in the zone due to commuting during work hours
 
  
• OBSERVED WEATHER
+
=<font size=4>'''Observed Weather'''</font>=
Attributes:
+
The following four attributes are provided:<br>
 
Date – date of observed weather by weather station
 
Date – date of observed weather by weather station
 
Weather – weather conditions for a particular day
 
Weather – weather conditions for a particular day

Revision as of 16:58, 13 October 2017

Data Description

Microblog Messages

The provided CSV file contains 1,023,077 microblog messages from the users post between April 30th, 2011 to May 20th, 2011. The following four attributes are provided in the dataset (Please see the sample below):
• ID – personal identifier of the individual posting the message
• Created_at – date and time of the post
• Location – latitude and longitude coordinates of the mobile device at the time of post
• Text – the posted message

Smartpolis Map

The provided PNG image file contains map information for the entire metropolitan area with labeled highways, hospitals, important landmarks, and water bodies.

Population Statistics

The following three attributes are provided:
• Zone_Name – the name of one of the 13 city zones within the metropolitan area
• Population_Density – the number of residents in the zone
• Daytime_Population – the estimated population in the zone due to commuting during work hours

Observed Weather

The following four attributes are provided:
Date – date of observed weather by weather station Weather – weather conditions for a particular day Average_Wind_Speed – measured in miles per hour Wind_Direction – the direction from which the wind is blowing or from which it originates

• ADDITIONAL INFORMATION 1) Economy – The economy of Smartopolis is based on commerce, entertainment, finance, trucking services, shipping services, health care, and industry. 2) Water Supply - Residents and businesses get their drinking water by pumping water from nearby reservoirs or rivers. These distributed water systems are both public and privately owned. 3) Entertainment – Smartopolis has two stadiums (Vastopolis Dome and Westside Stadium) for sports, concerts, and other events. The various lakes and the Vast River, which flows south at a steady rate of three miles per hour, is used for water-based sports and recreation. 4) City Administration – Smartopolis has several locations of significance including a state courthouse, a capitol building, convention center, and a large airport. Data preparation 1. Split the column ‘Location’ into two separate columns ‘Latitude’ and ‘Longitude’. As Smartpolis is located in the Western Hemisphere and northern latitude, the separate latitude and longitude should be positive and negative value respectively. 2. Split the column ‘Created_at’ into two separate columns ‘Date’ and ‘Time’ so that they can be used separately in the following steps. 3. Check the missing data pattern of ‘Date’ and ‘Time’. From the results, we can see that there are 21 rows with missing values in ‘Date’ and ‘Time’ as shown below. From the details of these 21 rows, we can see that the time of these messages in the column ‘Created_at’ are lost and shown as text.


4. Exclude these 21 rows without the time of posted microblogs mentioned above. As the number of these rows is small, it should be reasonable to exclude them from the dataset, which will not affect the analysis result. 5. Create a new column ‘Flu’ using the following formula to label the rows with word ‘flu’ in the microblog so that we can have a quick check and detection of the flu-related microblogs later.

6. Subset the records with ‘Flu’=1, which are the messages with ‘flu’ in the text (8727 rows) and plot these record’s distribution by Date. The distribution below shows that the messages with ‘flu’ increase significantly on the last two days (May 19th and May 20th). Thus, before the further analysis, we can tentatively assume that the epidemic outbroke in the last few days of the period.

7. Perform the text mining to find out the flu-related key words. From the previous distribution, we assume that the epidemic outbroke on the last few days. Thus, we can use the Text Explorer in the JMP Pro to recognize the flu-related key words from the messages of May 19th and May 20th.


From the top frequent unigrams and bigrams and combination of the introduction materials and exploration of the message, the following unigrams and bigrams are determined to be used as the flu-related key words: ‘headache’, 'die', 'medicine', 'death', 'breath', 'stomach', 'pneumonia', 'diarrhea', 'cough', 'flu', 'fever', 'chill', 'sweat', 'ache', ’pain', 'fatigue', 'nausea', 'vomit', 'lymph', 'sick', 'throat sore', 'runny nose', 'muscle', ‘sleep', 'ill' These key words will be used to recognize the flu-related messages in the following steps. 8. Create a new column ‘No. Keyword’ to count the number of flu-related key words mentioned above in each microblog. The following formula is used to count the number.

9. Create a new column ‘Report_illness’, which is a binary variable to label the microblogs with at least two key words in the text using the following formula.

10. Subset the records with ‘Report_illness=1’ and create a new dataset ‘preprocess.txt’ to stand for all the flu-related microblogs (35744 rows). The sample data from ‘preprocess.txt’ is as follows.